ZXFoundation™ Development Guide
Document Revision: 26h1.0
Applies to: ZXFoundation™ release 26h1 and later
Status: Active development
About This Document
This guide is the primary technical reference for the ZXFoundation™ kernel and its associated toolchain. It is written for:
- OS developers who wish to understand the z/Architecture boot and execution environment.
- Kernel contributors who need a precise description of subsystem contracts and initialization order.
- Integrators who want to load their own kernel or module using the ZXFL bootloader.
Familiarity with C23, ELF64, and general operating-system concepts is assumed. Background on IBM z/Architecture is provided in the Architecture chapter.
What Is ZXFoundation™?
ZXFoundation™ is a freestanding, SMP-capable kernel for IBM z/Architecture (s390x) mainframes and emulators. It is written in C23 and targets the s390x-unknown-none-elf ABI.
The project comprises three independently versioned components:
| Component | Output artifact | Description |
|---|---|---|
| ZXFL | core.zxfoundationloader00.sys, core.zxfoundationloader01.sys | Two-stage bootloader |
| Nucleus | core.zxfoundation.nucleus | Kernel ELF64 image |
| Host tools | bin2rec, zxsign | Build-time utilities |
All three are built from a single CMake project using a cross-compiler toolchain targeting s390x.
Version Scheme
Releases follow the scheme YYhN, where YY is the two-digit year and N is the half-year index (1 = first half, 2 = second half). The current release is 26h1.
The boot protocol carries its own version field (ZXFL_VERSION_*). A kernel must check this field and refuse to boot if the version is not one it understands.
Document Organization
| Chapter | Contents |
|---|---|
| Architecture | z/Architecture fundamentals: PSW, DAT, CCW, IPL, paging |
| Bootloader | ZXFL design, stage descriptions, boot protocol |
| Kernel | Subsystem table, initialization sequence, memory management |
| Build System | CMake modules, toolchains, configuration variables |
| Host Tools | bin2rec and zxsign reference |
Quick Start
# Configure with the Clang toolchain (recommended)
cmake -B build -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain/zxfoundation-clang.cmake
# Build everything
cmake --build build
# Generate the DASD image and launch Hercules
cmake --build build --target dasd
hercules -f build/hercules.cnf
In the Hercules console, issue ipl 0100 to start the boot sequence.
See Build System for full configuration options and Build Targets for a description of each output artifact.
Architecture Overview
Document Revision: 26h1.0
Reference: IBM z/Architecture Principles of Operation, SA22-7832
1. z/Architecture
z/Architecture (s390x) is IBM's 64-bit mainframe instruction set, introduced with the z900 in 2000. It supersedes ESA/390 (31-bit) and System/370 (24-bit). ZXFoundation™ targets z/Architecture exclusively; ESA/390 compatibility mode is used only during the first instruction of the IPL sequence.
Key properties that distinguish z/Architecture from commodity architectures:
- All I/O is performed through the Channel Subsystem (CSS). There is no memory-mapped I/O.
- The Program Status Word (PSW) encodes the instruction address, addressing mode, DAT enable, and all interrupt masks in a single 128-bit register.
- The Lowcore at physical address
0x0is the hardware-defined interrupt vector table with a fixed layout. - Inter-processor communication uses the SIGP instruction rather than memory-mapped registers or MSIs.
- The STFLE instruction enumerates optional hardware facilities (analogous to CPUID on x86).
2. Program Status Word (PSW)
The PSW is 128 bits wide. It is loaded atomically by LPSWE and saved atomically on every interrupt.
Bits 0–63: Mask word
Bit 1: PER enable
Bit 5: DAT enable
Bit 6: I/O interrupt mask
Bit 7: External interrupt mask
Bit 8: Problem state (0=supervisor, 1=user)
Bits 12–15: Condition Code
Bit 31: EA (Extended Addressing) — must be 1 for 64-bit
Bit 32: BA (Basic Addressing) — must be 0 for 64-bit
Bits 64–127: Instruction address (64-bit)
EA=1, BA=0 selects 64-bit addressing mode. SAM64 sets this without altering other PSW fields.
Disabled-wait PSW: All interrupt masks cleared, wait bit set. The CPU halts permanently. Used as the panic state.
New PSWs: For each interrupt class (I/O, external, machine check, program, restart, SVC), the architecture reserves a fixed lowcore offset for a "new PSW" — the PSW loaded when that interrupt fires. The kernel must install valid new PSWs before enabling the corresponding interrupt class.
3. Lowcore (Prefix Area)
The lowcore is the 4 KB region at physical address 0x0. Its layout is fixed by the architecture.
| Offset | Content |
|---|---|
0x000 | IPL PSW |
0x008 | IPL CCW1 |
0x010 | IPL CCW2 |
0x068 | Restart new PSW |
0x0B8 | Subchannel ID of IPL device |
0x1C0 | External new PSW |
0x1C8 | SVC new PSW |
0x1D0 | Program new PSW |
0x1D8 | Machine check new PSW |
0x1E0 | I/O new PSW |
The prefix register (set by SPX, read by STPX) maps a per-CPU physical page to the logical lowcore address 0x0. Each CPU has its own private lowcore page; the BSP uses physical page 0, APs use separately allocated pages.
4. Channel Command Words (CCW) and I/O
All device I/O is performed through the Channel Subsystem. The CPU constructs a Channel Program — a linked list of CCWs — and submits it via SSCH (Start Subchannel).
CCW Format-1 (8 bytes)
Bits 0–7: Command code (0x02=Read, 0x01=Write, 0x08=Sense)
Bits 32–63: Channel Data Address (CDA) — physical address of data buffer
Bit 65: Chain Command (CC) — link to next CCW
Bits 80–95: Byte count
Critical constraint: The CDA field is 31 bits. All I/O data buffers must reside below physical address
0x80000000. This is whyZONE_DMAcovers[0, 16 MB).
I/O Sequence
CPU Channel Subsystem
│ │
├─ SSCH (schid, ORB) ────────► │ Submit channel program
│ ├─ Execute CCW chain, transfer data
│◄──────── I/O interrupt ──────┤ Subchannel status available
├─ TSCH (schid, IRB) ────────► │ Read Interrupt Response Block
│◄──────── IRB ────────────────┤ Device status, residual count
5. Initial Program Load (IPL)
When the operator issues a LOAD command, the channel subsystem performs the following automatically:
- Reads the first physical record from the IPL device (ECKD: C=0, H=0, R=1) into physical address
0x0. - The record contains an IPL PSW at
0x0and two CCWs at0x8/0x10. - The CSS executes the CCW chain to load additional data.
- The CPU loads the IPL PSW and begins execution.
For ZXFL, the IPL PSW is a 31-bit ESA/390 PSW pointing to the Stage 0 entry. The first instruction switches to z/Architecture mode via SIGP SET ARCHITECTURE.
6. Dynamic Address Translation (DAT)
DAT is enabled by PSW bit 5. When on, every memory access is translated through the page table hierarchy rooted at the ASCE in CR1.
Address Space Control Element (ASCE)
The ASCE is a 64-bit value in CR1 encoding the physical address of the root table, the Designation Type (DT), and the Table Length (TL). ZXFoundation™ uses DT=11 (Region-First), selecting 5-level paging.
5-Level Page Table Hierarchy
| Level | Name | Entries | Coverage per entry |
|---|---|---|---|
| ASCE → | R1 (Region-First) | 2048 | 8 PB |
| R1 → | R2 (Region-Second) | 2048 | 4 TB |
| R2 → | R3 (Region-Third) | 2048 | 2 GB |
| R3 → | Segment Table | 2048 | 1 MB |
| Seg → | Page Table | 256 | 4 KB |
Each R1–Segment table is 16 KB (2048 × 8 bytes). Each page table is 4 KB (256 × 8 bytes).
Virtual Address Decomposition (DT=11)
63 53 52 42 41 31 30 20 19 12 11 0
┌────────┬──────────┬──────────┬──────────┬────────┬──────────┐
│ RFX │ RSX │ RTX │ SX │ PX │ BX │
│ 11 bit │ 11 bit │ 11 bit │ 11 bit │ 8 bit │ 12 bit │
└────────┴──────────┴──────────┴──────────┴────────┴──────────┘
R1 idx R2 idx R3 idx Seg idx PT idx Byte offset
Large Pages (EDAT)
| Facility | STFLE bit | Page size | Mechanism |
|---|---|---|---|
| EDAT-1 | 8 | 1 MB | FC=1 in Segment Table Entry |
| EDAT-2 | 78 | 2 GB | FC=1 in Region-Third Entry |
7. Virtual Address Space Layout
0x0000000000000000 User space (future)
...
0x00007FFFFFFFFFFF User space top
[ unmapped — translation exception ]
0xFFFF800000000000 HHDM base (CONFIG_KERNEL_VIRT_OFFSET)
Physical memory linearly mapped here.
PA 0x0 → VA 0xFFFF800000000000
0xFFFFC00000000000 vmalloc / ioremap region
0xFFFFFFFFFFFFFFFF Top of address space
The HHDM offset is 0xFFFF800000000000. The bootloader builds this mapping before transferring control; all kernel pointers in the boot protocol are HHDM virtual addresses.
8. Physical Memory Zones
| Zone | Range | Purpose |
|---|---|---|
ZONE_DMA | [0, 16 MB) | Channel I/O buffers (31-bit CDA constraint) |
ZONE_NORMAL | [16 MB, RAM limit) | General kernel allocations |
9. Control Registers
| Register | Purpose |
|---|---|
| CR0 | I/O/external interrupt subclass masks, feature enables |
| CR1 | Primary ASCE (page table root) |
| CR6 | I/O interrupt subclass mask (extended) |
| CR14 | Machine check interrupt mask |
The bootloader saves CR0, CR1, and CR14 snapshots in the boot protocol so the kernel can inspect the handover state.
Bootloader Overview
Document Revision: 26h1.0
1. What Is ZXFL?
ZXFL (ZXFoundation™ Loader) is the two-stage bootloader for ZXFoundation™. It is the only supported mechanism for loading the kernel nucleus. Its responsibilities are:
- Transition the CPU from ESA/390 to z/Architecture 64-bit mode.
- Locate and load the kernel ELF64 image from DASD.
- Verify kernel integrity (ZXVL structural lock, handshake, SHA-256 checksums).
- Probe hardware: memory, CPUs, TOD clock, system identification.
- Build the 5-level page tables (identity map + HHDM).
- Populate the boot protocol structure.
- Transfer control to the kernel entry point with DAT enabled.
2. Two-Stage Design
The split is imposed by a hard architectural constraint: the IPL mechanism loads exactly one record from the IPL device into physical address 0x0 and executes it. That record must contain the IPL PSW and enough code to load a larger second stage.
| Stage | Internal name | Dataset | Load address | Size limit |
|---|---|---|---|---|
| 0 | zxfl_stage1 | CORE.ZXFOUNDATIONLOADER00.SYS | 0x0 | 12 KB |
| 1 | zxfl_stage2 | CORE.ZXFOUNDATIONLOADER01.SYS | 0x20000 | ~512 KB |
Stage 0 is a minimal DASD reader. Its only job is to find Stage 1 in the VTOC, load it to 0x20000, and jump to it.
Stage 1 is the full loader. It performs all hardware detection, ELF loading, integrity verification, page table construction, and the final jump to the kernel.
3. IPL Flow
Power-on / LOAD button
│
▼
Channel subsystem reads IPL record (C=0, H=0, R=1) → 0x0
│
▼
Stage 0 (arch/s390x/init/zxfl/stage1/)
├─ SIGP SET ARCHITECTURE → z/Architecture mode
├─ SAM64 → 64-bit addressing
├─ Clear BSS
├─ Find CORE.ZXFOUNDATIONLOADER01.SYS in VTOC
├─ Read it to 0x20000
└─ Jump to 0x20000
│
▼
Stage 1 (arch/s390x/init/zxfl/stage2/)
├─ Install disabled-wait new PSWs (lowcore)
├─ Clear BSS (MVCL)
├─ STFLE — detect facilities
├─ Probe IPL device (ECKD / FBA Sense ID + RDC)
├─ Read parmfile (ETC.ZXFOUNDATION.PARM)
├─ Find CORE.ZXFOUNDATION.NUCLEUS in VTOC
├─ Load ELF64 PT_LOAD segments to physical memory
├─ ZXVL: structural lock + handshake + SHA-256 checksums
├─ Probe memory (write-pattern test)
├─ Load sysmodule= modules
├─ Detect SMP (SIGP Sense), STSI, TOD (STCK)
├─ Build 5-level page tables (identity + HHDM)
├─ Translate all protocol pointers to HHDM virtual
└─ LPSWE → kernel entry point (DAT on, interrupts masked)
4. Dataset Names
All datasets reside on the IPL DASD volume. Names follow the IBM MVS convention (uppercase, dot-separated, max 44 characters).
| Dataset | Contents |
|---|---|
CORE.ZXFOUNDATIONLOADER00.SYS | Stage 0 IPL record |
CORE.ZXFOUNDATIONLOADER01.SYS | Stage 1 flat binary |
CORE.ZXFOUNDATION.NUCLEUS | Kernel ELF64 |
ETC.ZXFOUNDATION.PARM | Boot parameters (parmfile) |
Additional datasets may be listed in the parmfile via sysmodule= entries.
5. Parmfile
The parmfile ETC.ZXFOUNDATION.PARM is a plain-text file read by Stage 1. Supported keys:
| Key | Description | Default |
|---|---|---|
syssize= | Memory probe limit in MB | 512 |
sysmodule= | Dataset name of an additional module to load | (none) |
Multiple sysmodule= lines are permitted (up to 16).
6. Constraints
- All CCW channel data addresses must be 31-bit (<
0x80000000). Static BSS buffers satisfy this automatically. - Stage 0 must fit within 12 KB (enforced by
ASSERTinstage1.ld). - The Stage 1 stack is 32 KB. The kernel must switch to its own stack before consuming more than ~8 KB.
- The kernel entry point must be ≥
0xFFFF800000040000(HHDM + 256 KB). The loader enforces this.
Stage 0
Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/stage1/
1. Purpose
Stage 0 is the minimal IPL loader. It occupies the first record on the IPL DASD volume and is loaded by the channel subsystem into physical address 0x0. Its sole responsibility is to locate Stage 1 (CORE.ZXFOUNDATIONLOADER01.SYS) in the VTOC, read it to 0x20000, and jump to it.
2. Entry Point (head.S)
The channel subsystem loads the IPL record and executes the PSW at offset 0x0. This PSW is a 31-bit ESA/390 PSW pointing to stage1_entry.
The entry sequence:
stage1_entry:
1. SIGP SET ARCHITECTURE (order 0x12) → switch to z/Architecture
Retry with "restore PSWs" flag if first attempt fails.
2. SAM64 → enable 64-bit addressing mode
3. Clear BSS (byte loop — MVCL is unsafe before architecture switch)
4. Set stack pointer to stage1_stack_top − 160
5. Load schid from lowcore offset 0xB8
6. Call zxfl00_entry(schid)
7. Disabled-wait PSW (fallback — zxfl00_entry is [[noreturn]])
The 160-byte stack offset is the standard z/Architecture register save area size.
3. Main Function (entry.c — zxfl00_entry)
Execution order:
diag_setup()— flush any partial DIAG 8 output line.- Print the Stage 0 banner via DIAG 8.
dasd_find_dataset(schid, "CORE.ZXFOUNDATIONLOADER01.SYS", &ext)— locate Stage 1 in the VTOC.- Read the dataset track-by-track into
0x20000usingdasd_read_next. - Sanity-check: verify the loaded image is not a disabled-wait PSW.
- Jump to
0x20000withschidin%r2.
4. Linker Script (stage1.ld)
| Section | Address | Notes |
|---|---|---|
.text.ipl | 0x0 | IPL PSW (8 bytes) |
.text | 0x58 | Code (after lowcore reserved area) |
.bss | after .text | Zero-initialized data |
An ASSERT in the linker script enforces that the entire stage fits within 12 KB. The build will fail if this limit is exceeded.
5. Stack
An 8 KB static array in BSS. The stack pointer is initialized to stage1_stack_top − 160.
6. Shared Library (common/)
Stage 0 uses a subset of the shared common/ library:
| Module | Purpose |
|---|---|
dasd_io.c | Low-level CCW I/O (SSCH/TSCH) |
dasd_vtoc.c | VTOC traversal and dataset lookup |
diag.c | DIAG 8 console output |
ebcdic.c | EBCDIC ↔ ASCII conversion |
panic.c | Disabled-wait on fatal error |
string.c | Minimal memcpy, memset, strcmp |
Stage 1
Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/stage2/
1. Purpose
Stage 1 is the full production loader. It is a flat binary linked at 0x20000, loaded there by Stage 0. It performs all hardware detection, kernel loading, integrity verification, page table construction, and the final transfer of control to the kernel.
2. Entry Point (entry.S — stage2_entry)
stage2_entry:
1. Save schid from %r2 into a callee-saved register (%r13)
2. Call zxfl_lowcore_setup() — install disabled-wait new PSWs
3. SSM 0x00 — mask all interrupts off
4. Clear BSS with MVCL (pad-fill mode, source length = 0)
5. Set stack pointer to stage2_stack_top − 160
6. Restore schid into %r2
7. Call zxfl01_entry(schid)
SSM 0x00 is issued immediately after zxfl_lowcore_setup installs safe new PSWs. Any interrupt that fires during the loader will hit a known disabled-wait rather than garbage.
BSS is cleared with MVCL in pad-fill mode (source length = 0, pad byte = 0x00). This is safe in 64-bit mode and faster than a byte loop for large BSS sections.
3. Main Function (entry.c — zxfl01_entry)
Execution order:
| Step | Action |
|---|---|
| 1 | STFLE — store facility list into proto.stfle_fac[] |
| 2 | CR setup — clear I/O, external, machine-check masks in CR0; zero CR6 and CR14 |
| 3 | Device probe — probe_ipl_device(): ECKD Sense ID first, then FBA; populates ipl_dev_type and ipl_dev_model |
| 4 | Parmfile — read ETC.ZXFOUNDATION.PARM; parse syssize= |
| 5 | Nucleus load — dasd_find_dataset_extents + zxfl_load_elf64 |
| 6 | ZXVL — structural lock check, handshake, SHA-256 segment checksums |
| 7 | Memory probe — write-pattern test at 1 MB granularity up to syssize or 512 MB |
| 8 | Module loading — load each sysmodule= dataset as a flat binary after the kernel image |
| 9 | System detection — zxfl_system_detect: STSI (manufacturer, model, LPAR), SIGP Sense (CPU map), STCK (TOD) |
| 10 | Protocol finalization — magic, version, binding token, stack canaries, CR snapshots |
| 11 | MMU + jump — zxfl_mmu_setup_and_jump: build page tables, translate pointers, LPSWE to kernel entry |
4. Linker Script (stage2.ld)
The binary is linked at 0x20000 as a flat ELF. The post-build step strips it to a raw binary with objcopy -O binary.
5. Stack
A 32 KB static array in BSS. The kernel receives a pointer to the top of this stack in %r15 and in proto->kernel_stack_top (HHDM virtual). The kernel must switch to its own stack before consuming more than ~8 KB.
6. Shared Library (common/)
Stage 1 uses the full common/ library:
| Module | Purpose |
|---|---|
dasd_io.c | Low-level CCW I/O |
dasd_vtoc.c | VTOC traversal |
dasd_eckd.c | ECKD device driver |
dasd_fba.c | FBA device driver |
dasd_tape.c | Tape device driver |
elfload.c | ELF64 segment loader |
mmu.c | Bootloader page table builder |
lowcore.c | Lowcore / new PSW setup |
zxvl_verify.c | ZXVL integrity checks |
parmfile.c | Parmfile parser |
stfle.c | STFLE facility detection |
system.c | STSI, SIGP Sense, STCK |
diag.c, ebcdic.c, panic.c, string.c | Utilities |
DASD Subsystem
Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/dasd_*.c
1. Overview
ZXFL supports three DASD device types. The correct driver is selected automatically by probing the IPL device with Sense ID and Read Device Characteristics (RDC) CCWs.
| Type | Driver | Typical device |
|---|---|---|
| ECKD | dasd_eckd.c | 3390 (most common) |
| FBA | dasd_fba.c | 9336 |
| Tape | dasd_tape.c | 3480, 3490, 3590 |
2. Low-Level I/O (dasd_io.c)
All device access goes through a single CCW submission layer:
dasd_do_io(schid, ccw_chain, sense_buf)
│
├─ Build ORB pointing to ccw_chain
├─ SSCH(schid, ORB)
├─ Wait for I/O interrupt (disabled-wait loop on TSCH)
├─ TSCH(schid, IRB) → check device end status
└─ Return status or panic on unrecoverable error
All CCW data buffers are static BSS arrays, ensuring they reside below 0x80000000 (31-bit CDA constraint).
3. ECKD Driver (dasd_eckd.c)
ECKD (Extended Count Key Data) is the standard format for IBM 3390 DASD. Addressing is by cylinder, head, and record number (C/H/R).
Key operations:
| Operation | CCW command | Description |
|---|---|---|
| Sense ID | 0xE4 | Identify device type and model |
| Read Device Characteristics | 0x64 | Obtain geometry (cylinders, heads, sectors) |
| Seek | 0x07 | Position to cylinder/head |
| Search ID Equal | 0x31 | Find record by C/H/R |
| Read Count Key Data | 0x86 | Read a full record |
Track reads use a Seek → Search → Read CCW chain. The search CCW loops (via TIC — Transfer in Channel) until the target record is found.
4. FBA Driver (dasd_fba.c)
FBA (Fixed Block Architecture) devices use linear block addressing. Each block is 512 bytes.
Key operations:
| Operation | CCW command | Description |
|---|---|---|
| Sense ID | 0xE4 | Identify device |
| Define Extent | 0x63 | Set the block range for the following operation |
| Locate Record | 0x43 | Specify starting block and count |
| Read | 0x42 | Transfer data |
5. Tape Driver (dasd_tape.c)
Tape support is provided for environments where the kernel is stored on a 3480/3490/3590 tape cartridge. Tape is read sequentially; there is no random access.
Key operations: Sense ID, Rewind, Read Block, Forward Space File.
6. Device Selection
At Stage 1 startup, probe_ipl_device() issues a Sense ID CCW to the IPL subchannel. The returned device type code selects the driver:
device_type == 0x3390 → ECKD
device_type == 0x9336 → FBA
device_type == 0x3480
0x3490
0x3590 → Tape
otherwise → panic("unsupported IPL device")
VTOC
Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/dasd_vtoc.c
1. What Is the VTOC?
The Volume Table of Contents (VTOC) is the directory of a z/Architecture DASD volume. It is an IBM-defined on-disk structure that maps dataset names to their physical extents (cylinder/head ranges on ECKD, or block ranges on FBA).
The VTOC begins at a fixed location recorded in the DASD label (Format-4 DSCB at cylinder 0, head 0, record 3 on ECKD). ZXFL reads the VTOC to locate the kernel and loader datasets by name.
2. DSCB Types
The VTOC consists of Data Set Control Blocks (DSCBs), each 140 bytes. ZXFL uses two types:
| Type | Format | Purpose |
|---|---|---|
| Format-1 | F1DSCB | Dataset name, creation date, first extent |
| Format-3 | F3DSCB | Additional extents (overflow from F1) |
| Format-4 | F4DSCB | VTOC descriptor — location and size of VTOC itself |
3. Dataset Lookup
dasd_find_dataset(schid, name, &ext)
│
├─ Read F4DSCB (C=0, H=0, R=3) → get VTOC start C/H and size
├─ For each DSCB in VTOC:
│ ├─ Read record
│ ├─ Check format byte
│ ├─ If F1DSCB: compare DS1DSNAM (44-byte EBCDIC name) to target
│ └─ If match: extract extent list from DS1EXT1..DS1EXT3
└─ Return first extent (cylinder/head start + end)
Dataset names are stored in EBCDIC on disk. ZXFL converts the search name from ASCII to EBCDIC before comparison using ebcdic_ascii_to_ebcdic().
4. Extent Structure
Each extent describes a contiguous range of tracks:
struct extent {
uint16_t cyl_start; // starting cylinder
uint16_t head_start; // starting head
uint16_t cyl_end; // ending cylinder (inclusive)
uint16_t head_end; // ending head (inclusive)
};
A dataset may span up to three extents in its F1DSCB, with additional extents in a chained F3DSCB. ZXFL follows the F3 chain if the dataset requires more than three extents.
5. Sequential Read
After locating a dataset's extents, dasd_read_next() reads tracks sequentially:
for each extent:
for each track in [cyl_start/head_start .. cyl_end/head_end]:
Seek → Search R=1 → Read all records on track → append to buffer
The read stops when the buffer is full or all extents are exhausted.
ELF64 Loader
Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/elfload.c
1. Overview
zxfl_load_elf64 loads the kernel ELF64 image from DASD into physical memory. It processes only PT_LOAD program headers; all other segment types are ignored.
2. Load Sequence
zxfl_load_elf64(schid, dataset_name, load_base_out)
│
├─ Read ELF header (first 64 bytes)
├─ Validate: magic 0x7F 'E' 'L' 'F', EI_CLASS=2 (64-bit),
│ EI_DATA=2 (big-endian), e_machine=0x16 (s390)
├─ Read program header table (e_phoff, e_phnum entries)
├─ For each PT_LOAD segment:
│ ├─ Compute physical load address:
│ │ pa = p_paddr − CONFIG_KERNEL_VIRT_OFFSET
│ ├─ Read p_filesz bytes from file offset p_offset → pa
│ └─ Zero-fill [pa + p_filesz, pa + p_memsz)
└─ Return load_min (lowest p_paddr seen, stripped of HHDM offset)
3. Address Computation
The kernel is linked with virtual addresses in the HHDM range (p_vaddr ≥ 0xFFFF800000000000). The physical load address is derived by subtracting CONFIG_KERNEL_VIRT_OFFSET:
$$pa = p_paddr - \texttt{CONFIG_KERNEL_VIRT_OFFSET}$$
The loader does not use p_vaddr directly; it uses p_paddr to avoid ambiguity when the linker script sets AT() addresses.
4. Constraints
- The kernel ELF must be
ET_EXEC(executable, not shared object). e_machinemust be0x16(EM_S390). Any other value causes an immediate panic.- All
PT_LOADsegments must havep_paddr ≥ CONFIG_KERNEL_VIRT_OFFSET. A segment below the HHDM offset is rejected. - The kernel entry point (
e_entry) must be ≥0xFFFF800000040000(HHDM + 256 KB). The loader enforces this before the final jump. - The total loaded image (all PT_LOAD segments) must fit within the memory probed by the write-pattern test.
5. BSS Zeroing
Segments where p_memsz > p_filesz have a BSS tail. The loader zeros this region with memset immediately after reading the file data. This ensures the kernel's BSS is clean before any ZXVL verification.
Bootloader MMU & HHDM
Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/mmu.c
1. Purpose
Before transferring control to the kernel, Stage 1 must enable DAT (Dynamic Address Translation) and establish the virtual address space the kernel expects. This involves building a 5-level page table hierarchy with two mappings:
| Mapping | Virtual range | Physical range | Purpose |
|---|---|---|---|
| Identity | [0x0, RAM) | [0x0, RAM) | Allows the loader itself to continue executing after DAT is enabled |
| HHDM | [HHDM_BASE, HHDM_BASE + RAM) | [0x0, RAM) | The kernel's primary view of physical memory |
HHDM_BASE = 0xFFFF800000000000 (CONFIG_KERNEL_VIRT_OFFSET).
2. Page Table Allocation
The bootloader allocates page tables from a bump allocator backed by a contiguous physical region immediately after the kernel image. The region base is the first 1 MB-aligned address after kernel_phys_end, floored at 32 MB. The end of this region is recorded in proto->pgtbl_pool_end.
The kernel PMM must mark [pool_base, pgtbl_pool_end) as reserved during initialization.
3. Build Sequence
zxfl_mmu_setup_and_jump(proto, entry_point)
│
├─ Allocate R1 table (16 KB, zero-filled)
├─ For each 4 KB page in [0, RAM):
│ ├─ Map VA = PA (identity)
│ └─ Map VA = PA + HHDM (HHDM)
├─ Build ASCE: R1_phys | DT=11 | TL=2048
├─ Load ASCE into CR1 (LCTL)
├─ Translate all proto pointer fields to HHDM virtual
├─ Set PSW.DAT = 1 in the new PSW
└─ LPSWE → entry_point (DAT on, interrupts masked)
Large pages (EDAT-1 / EDAT-2) are used if the corresponding STFLE facility is present, reducing the number of page table entries required.
4. Pointer Translation
All pointer fields in zxfl_boot_protocol_t that reference physical memory are translated to HHDM virtual addresses before the jump:
$$va = pa + \texttt{CONFIG_KERNEL_VIRT_OFFSET}$$
This includes mem_map_addr, kernel_entry, kernel_stack_top, cmdline_addr, and lowcore_phys. The kernel must not attempt to dereference any protocol pointer as a physical address.
5. State at Kernel Entry
| Resource | State |
|---|---|
| DAT | On — CR1 holds the ASCE built by the loader |
| Interrupts | Masked — all interrupt classes disabled |
%r2 | HHDM virtual address of zxfl_boot_protocol_t |
%r15 | HHDM virtual address of initial stack top (32 KB) |
| All other GPRs | Undefined |
Boot Protocol
Document Revision: 26h1.0
Protocol version: ZXFL_VERSION_4 (0x00000004)
1. Overview
The kernel receives a pointer to zxfl_boot_protocol_t in %r2 at entry. All pointer fields are HHDM virtual addresses. The struct is version 4.
The kernel must validate proto->magic == ZXFL_MAGIC (0x5A58464C, "ZXFL") before using any other field. A mismatch indicates the wrong value is in %r2 or the loader did not complete correctly.
2. Header Fields
| Field | Type | Value / Description |
|---|---|---|
magic | u32 | 0x5A58464C ("ZXFL") |
version | u32 | 0x00000004 |
flags | u32 | Bitmask of ZXFL_FLAG_* (see §8) |
binding_token | u64 | ZXVL_SEED ^ stfle_fac[0] ^ ipl_schid |
3. Loader Identity
| Field | Type | Description |
|---|---|---|
loader_major | u16 | Major version (1) |
loader_minor | u16 | Minor version (0) |
loader_timestamp | u32 | Build time encoded as HHMMSSZx |
4. IPL Device
| Field | Type | Description |
|---|---|---|
ipl_schid | u32 | Subchannel ID of the IPL device |
ipl_dev_type | u16 | Device type from Sense ID (e.g. 0x3390) |
ipl_dev_model | u16 | Device model from Sense ID |
5. Kernel Image
| Field | Type | Description |
|---|---|---|
kernel_phys_start | u64 | Physical base of loaded kernel |
kernel_phys_end | u64 | Physical end (exclusive), after modules |
kernel_entry | u64 | ELF entry point (HHDM virtual) |
6. Memory Map
| Field | Type | Description |
|---|---|---|
mem_total_bytes | u64 | Total usable + kernel RAM |
mem_map_addr | u64 | HHDM virtual address of zxfl_mem_region_t[] |
mem_map_count | u32 | Number of valid entries |
Each zxfl_mem_region_t entry is defined as:
| Field | Type | Description |
|---|---|---|
base | u64 | Physical base address of the region |
length | u64 | Length of the region in bytes |
type | u32 | ZXFL_MEM_* constant |
numa_node | u8 | Logical NUMA node ID this memory region belongs to |
7. Page Table Pool
| Field | Type | Description |
|---|---|---|
pgtbl_pool_end | u64 | Physical end of bootloader page-table bump pool |
Pool base is the first 1 MB-aligned address after kernel_phys_end, floored at 32 MB. The kernel PMM must mark [pool_base, pgtbl_pool_end) as reserved.
8. Kernel Stack
| Field | Type | Description |
|---|---|---|
kernel_stack_top | u64 | HHDM virtual address of initial stack top (32 KB) |
The kernel should switch to its own stack as early as possible and treat this region as reserved.
9. Control Register Snapshots
| Field | Type | Description |
|---|---|---|
cr0_snapshot | u64 | CR0 at time of kernel jump |
cr1_snapshot | u64 | CR1 (ASCE) at time of jump |
cr13_snapshot | u64 | CR13 at time of jump |
10. SMP / CPU Map
| Field | Type | Description |
|---|---|---|
cpu_map[] | zxfl_cpu_info_t[128] | Up to 128 CPU entries |
cpu_count | u32 | Valid entries in cpu_map |
bsp_cpu_addr | u16 | CPU address of the boot processor |
Each zxfl_cpu_info_t:
| Field | Type | Description |
|---|---|---|
cpu_addr | u16 | CPU address (0–65535) |
type | u8 | ZXFL_CPU_TYPE_* constant |
state | u8 | ZXFL_CPU_ONLINE or ZXFL_CPU_STOPPED |
numa_node | u8 | Logical NUMA node ID derived from physical book/socket |
drawer_id | u8 | Drawer physical identifier from STSI 15.1.x |
book_id | u8 | Book physical identifier from STSI 15.1.x |
socket_id | u8 | Socket physical identifier from STSI 15.1.x |
chip_id | u8 | Chip physical identifier from STSI 15.1.x |
thread_id | u8 | Thread physical identifier from STSI 15.1.x |
Valid when ZXFL_FLAG_SMP is set.
11. System Identification
Populated from STSI when ZXFL_FLAG_SYSINFO is set:
| Field | Description |
|---|---|
manufacturer[16] | ASCII, e.g. "IBM" |
type[4] | Machine type, e.g. "2964" |
model[16] | Model identifier |
sequence[16] | Machine serial number |
plant[4] | Manufacturing plant code |
lpar_name[8] | LPAR name (STSI 2.2.2); empty on bare metal |
lpar_number | LPAR number |
cpus_total | Total CPUs in CEC |
cpus_configured | Configured CPUs |
cpus_standby | Standby CPUs |
capability | CPU capability rating |
12. Modules
Up to 16 modules loaded from sysmodule= parmfile entries:
| Field | Description |
|---|---|
modules[i].name[32] | Dataset name (NUL-terminated) |
modules[i].phys_start | Physical load address |
modules[i].size_bytes | Size in bytes |
13. Flags
| Flag | Bit | Meaning |
|---|---|---|
ZXFL_FLAG_SMP | 0 | cpu_map[] is valid |
ZXFL_FLAG_MEM_MAP | 1 | mem_map is valid |
ZXFL_FLAG_CMDLINE | 2 | cmdline_addr is valid |
ZXFL_FLAG_LOWCORE | 3 | lowcore_phys is valid |
ZXFL_FLAG_STFLE | 4 | stfle_fac[] is valid |
ZXFL_FLAG_SYSINFO | 5 | sysinfo is valid |
ZXFL_FLAG_TOD | 6 | tod_boot is valid |
14. Binding Token
The binding token ties the boot session to the specific hardware and IPL device:
$$\texttt{binding_token} = \texttt{ZXVL_SEED} \oplus \texttt{stfle_fac[0]} \oplus \texttt{ipl_schid}$$
The kernel must recompute this value and compare it to proto->binding_token. A mismatch means the protocol was tampered with or the kernel is running on unexpected hardware.
The binding token is also used as a component of the ZXVL handshake nonce and the stack frame canary. See ZXVL Verification.
ZXVL Verification
Document Revision: 26h1.0
Source: arch/s390x/init/zxfl/common/zxvl_verify.c
1. Overview
ZXVL (ZXVerifiedLoad) is the integrity verification layer embedded in the ZXFL bootloader. It prevents arbitrary payloads from being loaded as the kernel nucleus. Three mechanisms are applied in sequence after ELF loading, before DAT is enabled.
2. Structural Lock
The kernel must embed a .zxfl_lock section at fixed offsets from its physical load base (load_min):
Offset from load_min | Content |
|---|---|
0x70000 | High 32 bits of lock key: 0xCCBBCC35 |
0x70004 | Sentinel: 0x5A58464C ("ZXFL") |
0x71000 | Low 32 bits of lock key: 0xE5664311 |
The loader verifies:
$$(\texttt{key} \oplus \texttt{ZXVL_LOCK_MASK}) = \texttt{ZXVL_LOCK_EXPECTED}$$
where:
- $\texttt{key} = (\texttt{hi} \ll 32) \mid \texttt{lo}$
- $\texttt{ZXVL_LOCK_MASK} = \texttt{0x3C1E0F8704B2D596}$
- $\texttt{ZXVL_LOCK_EXPECTED} = \texttt{0xF0A5C3B2E1D49687}$
A missing sentinel or wrong key causes an immediate panic — the loader refuses to execute the image.
3. Handshake
The kernel must place a callable function stub at load_min + 0x0 (the very first byte of the loaded image). The stub must implement:
$$f(\texttt{nonce}) = \text{rotl}_{17}(\texttt{nonce}) + \texttt{ZXVL_HS_RESPONSE}$$
where $\text{rotl}_{17}(x) = (x \ll 17) \mid (x \gg 47)$ and $\texttt{ZXVL_HS_RESPONSE} = \texttt{0xDEADBEEF0BADF00D}$.
The loader calls the stub with:
$$\texttt{nonce} = \texttt{ZXVL_SEED} \oplus \texttt{binding_token}$$
$$\texttt{binding_token} = \texttt{ZXVL_SEED} \oplus \texttt{stfle_fac[0]} \oplus \texttt{schid}$$
This ties the handshake to the specific hardware and IPL device. A kernel image that passes on one machine will not pass on another with different STFLE facilities or a different subchannel ID.
4. SHA-256 Segment Checksums
After the handshake, zxvl_verify_nucleus_checksums reads the zxvl_checksum_table_t from load_min + 0x80000 and verifies each entry:
$$\text{SHA-256}(\texttt{phys_start}, \texttt{size}) = \texttt{entry.digest}$$
Any mismatch causes an immediate panic. The table is patched into the kernel ELF by zxsign at build time. Any modification to a PT_LOAD segment after the build — including by a malicious bootloader or storage attack — is detected here.
5. Binding Token
The binding token is stored in proto->binding_token and used in two places:
- Handshake nonce (above).
- Stack frame canary:
frame[1] = ZXVL_FRAME_MAGIC_B ^ binding_token.
The canary value is unique per hardware configuration. A canary extracted from one system cannot be replayed on another.
The kernel must recompute the binding token on entry and compare it to proto->binding_token. See Boot Protocol §14.
Checksum Protocol
Document Revision: 26h1.1
1. Purpose
The checksum protocol ensures that the kernel image loaded into memory matches the image that was built and signed. It operates at two points:
| Point | Actor | Action |
|---|---|---|
| Build time | zxsign | Compute SHA-256 per PT_LOAD segment; patch into .zxvl_checksums |
| Boot time (loader) | zxvl_verify_nucleus_checksums | Recompute and compare before DAT is enabled |
| Boot time (kernel) | verify_kernel_checksums | Recompute and compare from HHDM after DAT is enabled |
The double verification (loader + kernel) ensures that neither a compromised loader nor a post-load memory modification can go undetected.
2. Table Location
The checksum table is placed in the .zxvl_checksums ELF section, which is emitted as a dedicated PT_LOAD segment with p_flags = ZXVL_PFLAGS_CKSUM (0x00200004).
The loader discovers the table's physical address by scanning the ELF program header table for a segment with that exact p_flags value. The physical address is stored in zxfl_boot_protocol_t::cksum_table_phys and passed to the kernel. No hardcoded offsets are used.
3. Table Format
See zxsign §3 for the full zxvl_checksum_table_t layout.
Key fields:
| Field | Value |
|---|---|
magic | 0x5A58564C ("ZXVL") |
version | 0x00000001 |
algo | 0x00000001 (SHA-256) |
count | Number of verified segments |
4. Excluded Segments
The segment containing .zxvl_checksums itself is excluded from the checksum computation. Hashing the table while building it would be circular. zxsign identifies and skips this segment automatically.
5. Kernel Re-verification
After the kernel initializes the PMM and VMM, verify_kernel_checksums re-reads the table from the HHDM virtual address and recomputes SHA-256 for each segment. This catches:
- Memory corruption between loader verification and kernel execution.
- A loader that passed verification but then modified segments before the jump.
A mismatch at this stage calls panic("sys: kernel segment checksum mismatch — image tampered").
How to Load Your Kernel with ZXFL
Document Revision: 26h1.0
for most up-to-date information, see ZXFL Barebones
This guide walks through every step required to produce a kernel image that ZXFL will accept and execute. Read the Boot Protocol and ZXVL Verification pages first for background.
Overview
ZXFL imposes five requirements on the kernel image before it will execute it:
- Valid ELF64 for s390x,
ET_EXEC, allPT_LOADsegments in the HHDM range. - Structural lock section at fixed offsets.
- Handshake stub at the physical load base.
- SHA-256 checksum table at
load_min + 0x80000, patched byzxsign. - Boot protocol validation on entry.
Step 1 — Link for the HHDM
All PT_LOAD segments must have virtual addresses at or above CONFIG_KERNEL_VIRT_OFFSET (0xFFFF800000000000). ZXFL computes the physical load address by subtracting this offset from p_paddr:
pa = p_paddr - 0xFFFF800000000000
No AT() override is needed. Because there is no LMA override in the linker script, p_paddr equals p_vaddr, and the loader strips the HHDM offset to get the physical address.
A minimal linker script skeleton (modelled on arch/s390x/init/link.ld):
ENTRY(my_kernel_entry)
PHDRS {
nucleus PT_LOAD FLAGS(7);
checksums_seg PT_LOAD FLAGS(4);
}
SECTIONS {
/* Handshake stub — must be the first code at the physical load base */
.zxfl_hs 0xFFFF800000100000 : {
KEEP(*(.zxfl_hs))
} :nucleus
.text 0xFFFF800000100400 : {
KEEP(*(.text.my_kernel_entry))
*(.text .text.*)
} :nucleus
.rodata : ALIGN(8) { *(.rodata .rodata.*) } :nucleus
.data : ALIGN(8) { *(.data .data.*) } :nucleus
/* Structural lock — fixed virtual offsets from load base */
.zxfl_lock 0xFFFF800000170000 : {
KEEP(*(.zxfl_lock))
} :nucleus
.bss : ALIGN(4096) {
*(.bss .bss.*) *(COMMON)
} :nucleus
/* Checksum table — fixed virtual offset from load base */
.zxvl_checksums 0xFFFF800000180000 : {
KEEP(*(.zxvl_checksums))
} :checksums_seg
}
The entry point (
e_entry) must be at or above0xFFFF800000040000(HHDM + 256 KB). ZXFL rejects images with a lower entry point.
Step 2 — Embed the Structural Lock
The lock constants can be placed directly in the linker script (as ZXFoundation™ does), or in a C translation unit:
/* In the linker script — simplest approach */
.zxfl_lock 0xFFFF800000170000 : {
LONG(0xCCBBCC35) /* hi */
LONG(0x5A58464C) /* sentinel "ZXFL" */
. = . + 0x1000 - 8;
LONG(0xE5664311) /* lo */
} :nucleus
The loader verifies: ((hi << 32 | lo) ^ 0x3C1E0F8704B2D596) == 0xF0A5C3B2E1D49687.
Step 3 — Implement the Handshake Stub
The stub must be the very first code at the physical load base. It receives a nonce in %r2 and must return the response in %r2. ZXVL_HS_RESPONSE = 0xDEADBEEF0BADF00D.
.machinemode zarch
.section .text.handshake, "ax"
.globl __zxfl_handshake_stub
.equ ZXFL_SEED_HI, 0xA5F0C3E1
.equ ZXFL_SEED_LO, 0xB2D49687
.equ HS_RESPONSE_HI, 0xDEADBEEF
.equ HS_RESPONSE_LO, 0x0BADF00D
__zxfl_handshake_stub:
llihf %r0, ZXFL_SEED_HI
iilf %r0, ZXFL_SEED_LO
xgr %r2, %r0
lgr %r0, %r2
sllg %r0, %r0, 17
srlg %r1, %r2, 47
ogr %r0, %r1
llihf %r1, HS_RESPONSE_HI
iilf %r1, HS_RESPONSE_LO
lgr %r2, %r0
agr %r2, %r1
br %r14
The stub must not clobber %r14 (return address) or %r15 (stack pointer). It must be callable with BRASL and return via BR %r14.
Step 4 — Reserve the Checksum Table
Declare the checksum table section. It is zero at link time; zxsign patches it after linking:
__attribute__((section(".zxvl_checksums"), used))
static volatile zxvl_checksum_table_t zxvl_cksum_table = { 0 };
Step 5 — Run zxsign
After linking, run the host tool on the ELF:
zxsign my_kernel.elf
This computes SHA-256 for each PT_LOAD segment (excluding .zxvl_checksums itself) and patches the table in-place. The ELF is now ready for DASD.
Step 6 — Write to DASD
Write the kernel ELF to the DASD volume as dataset CORE.ZXFOUNDATION.NUCLEUS. In sysres.conf:
DATASET CORE.ZXFOUNDATION.NUCLEUS my_kernel.elf
See Build Targets for the full dasdload invocation.
Step 7 — Handle the Boot Protocol on Entry
Your kernel entry point receives zxfl_boot_protocol_t *boot in %r2. Minimum required validation:
[[noreturn]] void my_kernel_entry(zxfl_boot_protocol_t *boot) {
if (!boot || boot->magic != ZXFL_MAGIC)
for (;;) __asm__("nop");
uint64_t expected = ZXVL_COMPUTE_TOKEN(boot->stfle_fac[0], boot->ipl_schid);
if (boot->binding_token != expected)
for (;;) __asm__("nop");
if (boot->version != ZXFL_VERSION_4)
for (;;) __asm__("nop");
/* proceed */
}
All pointer fields in the protocol are HHDM virtual addresses. Do not treat them as physical addresses.
Checklist
| # | Requirement | Enforced by |
|---|---|---|
| 1 | ELF64, ET_EXEC, e_machine = 0x16 (EM_S390) | Loader ELF validation |
| 2 | All PT_LOAD p_vaddr >= 0xFFFF800000000000 | Loader address check |
| 3 | e_entry >= 0xFFFF800000040000 | Loader entry check |
| 4 | Structural lock at load_min + 0x70000 | zxvl_verify |
| 5 | Handshake stub at load_min + 0x0 | zxvl_verify |
| 6 | Checksum table at load_min + 0x80000, patched by zxsign | zxvl_verify |
| 7 | boot->magic validated on entry | Kernel |
| 8 | boot->binding_token validated on entry | Kernel |
ZXFoundation™ Kernel Design
Document: ZXF-KRN-DESIGN-001 Revision: 26h1.0 Status: Draft Date: 2026-05-09 Author: ZXFoundation™ Core Team
Document Scope
This document is the master architectural specification for the ZXFoundation™ kernel. It defines the design of every major subsystem — capability system, memory architecture, IPC, domain model, scheduler, time, trap handling, fault recovery, and the long-term implementation roadmap.
This document does not reference source files or API signatures. Those belong in per-subsystem reference documents. This document defines what the kernel is and why it is designed that way. Pseudocode and diagrams are used where precision is required.
1. Architectural Philosophy
1.1 Design Axioms
ZXFoundation™ is a capability-based object microkernel for IBM z/Architecture. Six axioms govern every design decision:
-
Minimal Trusted Computing Base. The kernel enforces only what cannot be enforced elsewhere: memory isolation, capability validity, and CPU scheduling. Everything else is a server domain.
-
Capability-First. No resource may be accessed without a valid capability. There is no ambient authority. A thread that holds no capabilities can do nothing.
-
No Implicit Trust. Server domains are untrusted by default, including system-provided ones. Trust is established by capability grant, not by identity or position in a hierarchy.
-
z/Architecture Native. The kernel exploits z/Architecture hardware features — DAT, storage keys, SIGP, TOD clock, CPU timer, channel subsystem — directly. No portability layer is maintained.
-
SysV ABI Only. The kernel defines its own system call surface. No POSIX compatibility layer exists or is planned. The SysV calling convention (GPRs 2–7 for arguments, GPR 2 for return) is the sole ABI.
-
Extreme Redundancy. The kernel must not panic on a faulting server domain or a recoverable hardware error. Fault containment and recovery are first-class design requirements, not afterthoughts.
1.2 Threat Model
| Threat | Mitigation |
|---|---|
| Untrusted user domain reads kernel memory | Separate DAT address space per domain; kernel ASCE never loaded in user state |
| Untrusted domain forges a capability | Capabilities are kernel-managed integers; user space never constructs them |
| Faulting server domain corrupts kernel state | Server domains run in user state; a fault traps to the kernel, not into it |
| Hardware storage error corrupts a page | Machine-check recovery classifies and isolates the affected frame |
| Capability leak via IPC | Capability transfer is move-semantics; sender loses the capability atomically |
| Denial of service via busy loop | Scheduler enforces quanta; CPU timer interrupt is non-maskable by user state |
1.3 Kernel / User Boundary
The kernel runs exclusively in supervisor state (PSW problem-state bit = 0). All server domains and user processes run in problem state (PSW bit 8 = 1).
The boundary is enforced by z/Architecture hardware:
- DAT translates user virtual addresses through a per-domain ASCE (CR1 is loaded with the domain's ASCE on context switch).
- Storage keys restrict memory access to pages owned by the domain.
- Privileged instructions (
LPSWE,SPX,SIGP,SSCH, etc.) trap to the kernel when executed in problem state.
1.4 Layered Architecture
┌─────────────────────────────────────────────────────────────────┐
│ User Processes (problem state, own ASCE, own capability table) │
├─────────────────────────────────────────────────────────────────┤
│ Server Domains (problem state, own ASCE, own capability table) │
│ [ block I/O | filesystem | network | console | device mgr ] │
├─────────────────────────────────────────────────────────────────┤
│ Kernel TCB (supervisor state, kernel ASCE) │
│ ┌──────────┬──────────┬──────────┬──────────┬───────────────┐ │
│ │ Capability│ IPC │ Scheduler│ Memory │ Trap / Syscall│ │
│ │ System │ Subsystem│ │ Manager │ Dispatch │ │
│ └──────────┴──────────┴──────────┴──────────┴───────────────┘ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ KOMS · PMM · VMM · Slab · SMP · RCU · Sync Primitives │ │
│ └──────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ z/Architecture Hardware │
│ [ DAT · Storage Keys · SIGP · TOD · CPU Timer · CSS · MCCK ] │
└─────────────────────────────────────────────────────────────────┘
2. Capability System
2.1 Definition
A capability is an unforgeable, kernel-managed token that grants a specific set of rights to a specific kernel object. Possession of a capability is both necessary and sufficient to exercise the rights it encodes. There is no access control list, no ambient authority, and no privilege escalation path outside of explicit capability grant.
2.2 Capability Token Structure
A capability token is a 64-bit opaque integer. User space treats it as an integer handle into its own capability table. The kernel interprets the internal encoding; user space never constructs or decodes it.
63 56 55 40 39 24 23 0
┌────────┬──────────┬──────────┬──────────┐
│ type │ rights │ gen │ index │
│ 8 bit │ 16 bit │ 16 bit │ 24 bit │
└────────┴──────────┴──────────┴──────────┘
| Field | Width | Meaning |
|---|---|---|
type | 8 | Object type (maps to kobj_type_t::type_id) |
rights | 16 | Bitmask of granted rights |
gen | 16 | Generation counter; incremented on revocation |
index | 24 | Index into the kernel's global object table |
The gen field enables generation-based revocation: when a capability is
revoked, the kernel increments the generation counter on the target object.
Any token whose gen field does not match the current object generation is
invalid, regardless of index or rights.
2.3 Rights Model
Rights are type-specific. The following rights are defined at the kernel level; subsystems may define additional type-specific rights in the upper 8 bits.
| Bit | Name | Meaning |
|---|---|---|
| 0 | CAP_READ | Read the object's state |
| 1 | CAP_WRITE | Modify the object's state |
| 2 | CAP_EXEC | Execute / invoke the object |
| 3 | CAP_GRANT | Derive and transfer a capability to this object |
| 4 | CAP_REVOKE | Revoke derived capabilities |
| 5 | CAP_MAP | Map the object's memory into an address space |
| 6 | CAP_DESTROY | Destroy the object |
| 7–15 | reserved / type-specific |
Derivation rule: A derived capability may only have a subset of the
parent's rights. Rights can never be amplified. A domain that holds
CAP_READ | CAP_GRANT may derive a capability with CAP_READ only.
2.4 Capability Table
Each domain owns a capability table — a flat, kernel-managed array of capability slots. The table is allocated at domain creation with a fixed capacity. User space references capabilities by their slot index (a small integer handle).
Domain Capability Table
┌───────┬──────────────────────────────────────────────┐
│ Slot │ Capability Token (64-bit, kernel-interpreted) │
├───────┼──────────────────────────────────────────────┤
│ 0 │ Self capability (CAP_READ | CAP_WRITE) │
│ 1 │ IPC endpoint capability (CAP_EXEC) │
│ 2 │ Memory region capability (CAP_READ | CAP_MAP) │
│ 3 │ (empty) │
│ ... │ ... │
│ N-1 │ (empty) │
└───────┴──────────────────────────────────────────────┘
The capability table is allocated from a dedicated slab cache backed by pages with a non-zero s390x storage key. This provides hardware-enforced isolation: a domain cannot read another domain's capability table even if it obtains a pointer to it, because the storage key check will fault.
2.5 Capability Lifecycle
cap_mint(type, rights, object)
│
▼
┌─────────────────┐
│ CAPABILITY │
│ VALID │◄──── cap_derive(parent, subset_rights)
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
cap_transfer cap_revoke object destroyed
│ │ │
▼ ▼ ▼
moved to gen++ on all tokens
receiver's object; with this
table all tokens index become
with old gen invalid
invalid
2.6 Core Operations (Pseudocode)
// Mint a new capability for an existing kernel object.
// Called only from kernel context; never directly by user space.
cap_mint(object, rights):
slot = cap_table_alloc(current_domain.cap_table)
token.type = object.type_id
token.rights = rights
token.gen = object.cap_gen
token.index = object.global_index
current_domain.cap_table[slot] = token
return slot
// Derive a capability with reduced rights.
// Syscall: cap_derive(src_slot, new_rights) -> dst_slot
cap_derive(src_slot, new_rights):
token = cap_lookup(current_domain, src_slot)
assert token.rights & CAP_GRANT
assert (new_rights & ~token.rights) == 0 // no amplification
dst_slot = cap_table_alloc(current_domain.cap_table)
new_token = token
new_token.rights = new_rights
current_domain.cap_table[dst_slot] = new_token
return dst_slot
// Revoke all capabilities derived from an object.
// Increments the generation counter; all existing tokens become stale.
cap_revoke(object):
atomic_inc(object.cap_gen)
// No table scan needed: stale tokens fail at cap_lookup time.
// Look up and validate a capability slot.
// Returns the target object pointer, or fails.
cap_lookup(domain, slot):
assert slot < domain.cap_table.capacity
token = domain.cap_table[slot]
assert token.type != CAP_TYPE_INVALID
object = global_object_table[token.index]
assert object != null
assert object.cap_gen == token.gen // generation check
return object, token.rights
2.7 KOMS Integration
Every kobject_t is a capability target. The KOMS type_id field maps
directly to the capability token type field. The KOMS global object table
(indexed by token.index) is the authoritative registry of all live kernel
objects.
The capability system does not replace KOMS reference counting. A valid
capability implies the object is alive (generation check passes only while
the object is alive). When an object is destroyed, its generation is
incremented, invalidating all capabilities before the final koms_put.
┌─────────────────────────────────────────────────────┐
│ Capability System │
│ token.index ──────────────────────────────────┐ │
│ token.gen ──── generation check ────────┐ │ │
└────────────────────────────────────────────│───│───┘
│ │
┌────────────────────────────────────────────│───│───┐
│ KOMS │ │ │
│ global_object_table[index] ───────────────┘ │ │
│ kobject_t::cap_gen ───────────────────────────┘ │
│ kobject_t::ref (kref_t) — independent lifetime │
└─────────────────────────────────────────────────────┘
3. Memory Architecture
Memory is the most critical subsystem in ZXFoundation™. Every other subsystem depends on it. This section defines strict requirements and invariants for every memory layer. Violations of these requirements are kernel panics, not recoverable errors.
3.1 Physical Memory Manager (PMM)
3.1.1 Zone Model
Physical memory is partitioned into two zones at boot time. The partition is
permanent; zones are never merged or resized after pmm_init.
| Zone | Range | Purpose |
|---|---|---|
ZONE_DMA | [0, 16 MB) | Channel I/O buffers (31-bit CDA constraint) |
ZONE_NORMAL | [16 MB, RAM limit) | General kernel and domain allocations |
The 16 MB boundary is a hardware constraint: the Channel Data Address (CDA)
field in a CCW is 31 bits. All I/O buffers submitted to the channel subsystem
must reside below 0x80000000. ZONE_DMA covers this range conservatively.
3.1.2 Buddy Allocator
Each zone maintains a buddy allocator with orders 0 through MAX_ORDER (10),
covering block sizes from 4 KB (order 0) to 4 MB (order 10).
Zone free lists (per order):
Order 0 (4 KB): [pfn_a] → [pfn_b] → [pfn_c] → ∅
Order 1 (8 KB): [pfn_d] → ∅
Order 2 (16 KB): ∅
...
Order 10 (4 MB): [pfn_e] → ∅
Buddy invariants (non-negotiable):
- Every free block is buddy-aligned:
pfn % (1 << order) == 0. - Coalescing is mandatory on every free. If a block's buddy is also free,
they are merged into a block of order+1, recursively up to
MAX_ORDER. - A block may only be freed at the same order it was allocated. Mismatched order corrupts the buddy tree and is a kernel panic.
- Free blocks are poisoned with
PF_POISON. Any allocation that returns a non-poisoned block indicates a double-allocation bug.
3.1.3 Per-CPU Page Cache
Order-0 (4 KB) allocations are served from a per-CPU cache to avoid zone lock contention on the hot path.
Per-CPU cache (one per zone per CPU):
count = 7
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│pfn_0│pfn_1│pfn_2│pfn_3│pfn_4│pfn_5│pfn_6│ - │ - │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
← count PCP_HIGH=16 →
Refill: when count == 0, acquire zone lock, pop PCP_BATCH=8 pages.
Drain: when count > PCP_HIGH, acquire zone lock, push PCP_BATCH pages.
The per-CPU cache is accessed with IRQs disabled. No spinlock is needed
because the cache is strictly per-CPU and IRQ handlers that allocate memory
must use ZX_GFP_ATOMIC, which bypasses the per-CPU cache and draws
directly from the zone's atomic reserve.
3.1.4 Atomic Reserve
Each zone holds PMM_ATOMIC_RESERVE = 64 pages back from the buddy
allocator. These pages are only accessible to callers that pass
ZX_GFP_ATOMIC. This ensures that hard-IRQ context allocations (e.g.,
channel I/O completion handlers) always succeed even under memory pressure.
Strict requirement: ZX_GFP_ATOMIC must only be used from hard-IRQ
context. Using it from process context to bypass memory pressure is
prohibited and will be detected by a context check in debug builds.
3.1.5 PMM Allocation Flow
pmm_alloc_page(gfp):
if gfp & ZX_GFP_ATOMIC:
goto zone_alloc // bypass per-CPU cache
if order == 0:
page = pcp_pop(current_cpu, zone)
if page: return page
pcp_refill(current_cpu, zone)
return pcp_pop(current_cpu, zone)
zone_alloc:
acquire zone.lock (irqsave)
for order in [requested_order .. MAX_ORDER]:
pfn = free_area_pop(zone, order)
if pfn != INVALID:
split down to requested_order
release zone.lock
if gfp & ZX_GFP_ZERO: zero_page(pfn)
return pfn_to_page(pfn)
if gfp & ZX_GFP_ATOMIC and zone.atomic_reserve > 0:
// draw from reserve
...
release zone.lock
return nullptr // OOM
3.1.6 PMM Strict Requirements
| # | Requirement |
|---|---|
| PMM-1 | pmm_free_page/pages must never be called on a page not in PF_BUDDY state. Double-free is a kernel panic. |
| PMM-2 | The order passed to pmm_free_pages must match the order used at allocation. |
| PMM-3 | Allocation from hard-IRQ context requires ZX_GFP_ATOMIC. Any other flag in IRQ context is a kernel panic. |
| PMM-4 | zx_mem_map[] is allocated during pmm_init and never freed. It must not be modified after init except by the PMM itself. |
| PMM-5 | The per-CPU cache must be drained to the zone before a CPU goes offline. |
| PMM-6 | ZONE_DMA and ZONE_NORMAL boundaries are immutable after pmm_init. |
3.2 Virtual Memory Manager (VMM)
3.2.1 Address Space Layout
Virtual Address Space (64-bit z/Architecture, 5-level DAT)
0x0000_0000_0000_0000 ┌──────────────────────────────────────┐
│ User / Domain space │
│ (per-domain ASCE, problem state) │
0x0000_7FFF_FFFF_FFFF └──────────────────────────────────────┘
[ translation exception — unmapped ]
0xFFFF_8000_0000_0000 ┌──────────────────────────────────────┐
│ HHDM — Higher-Half Direct Map │
│ PA 0x0 → VA 0xFFFF_8000_0000_0000 │
│ Mapped with EDAT-1 (1 MB pages) │
0xFFFF_C000_0000_0000 ├──────────────────────────────────────┤
│ vmalloc / ioremap region │
│ Virtually contiguous, phys-discontig│
0xFFFF_E000_0000_0000 ├──────────────────────────────────────┤
│ Kernel image + BSS + static data │
0xFFFF_FFFF_FFFF_FFFF └──────────────────────────────────────┘
The HHDM offset 0xFFFF_8000_0000_0000 places the kernel in R1 entry 2047
(the topmost Region-First entry), cleanly separating kernel (R1[2047]) from
user space (R1[0..2046]) at the highest table level.
3.2.2 vm_space_t and VMA Tree
Each address space is represented by a vm_space_t. The kernel has one
(kernel_vm_space). Each domain has its own, created at domain birth and
destroyed at domain death.
VMAs are indexed by an augmented RB-tree keyed on vm_start. Each node
carries subtree_max_end — the maximum vm_end in its subtree — enabling
O(log n) free-gap search for vmalloc and O(1) overlap detection.
VMA Tree (augmented RB-tree):
[0xC000, 0xE000, max_end=0xF000]
/ \
[0xA000, 0xB000, max_end=0xB000] [0xE000, 0xF000, max_end=0xF000]
Each node: vm_start (key), vm_end, subtree_max_end, vm_prot, rb_node
Locking model:
- Readers call
vmm_find_vmainsidercu_read_lock(). Fully lockless. The RCU-protected tree guarantees that a reader always sees a consistent snapshot, even while a writer is modifying the tree. - Writers acquire
aug_root.lock(spinlock, irqsave) before any insert, remove, or augmentation update.
A per-CPU hint cache stores the last-found VMA per CPU. On a cache hit (the faulting address falls within the cached VMA), the tree walk is skipped entirely — O(1) on the hot page-fault path.
3.2.3 VMM Strict Requirements
| # | Requirement |
|---|---|
| VMM-1 | All VMA modifications must hold aug_root.lock (spinlock, irqsave). |
| VMM-2 | All VMA reads must be inside rcu_read_lock(). |
| VMM-3 | VMAs must not overlap. vmm_insert_vma rejects overlapping ranges. |
| VMM-4 | vm_start and vm_end must be page-aligned (4 KB boundary). |
| VMM-5 | A vm_space_t must not be destroyed while any VMA remains mapped. |
| VMM-6 | The kernel ASCE (CR1) must never be loaded into a domain's address space. |
| VMM-7 | EDAT large pages (1 MB, 2 GB) must not be used for user domain mappings without an explicit CAP_MAP capability granting large-page access. |
| VMM-8 | vmm_remove_vma must unmap all backing pages and perform a TLB invalidation (IPTE/IDTE) before returning. |
3.2.4 Domain Address Space Creation
When a new domain is created, the kernel allocates a fresh vm_space_t and
a new R1 page table. The kernel HHDM mapping is not shared into domain
address spaces. Domains have no visibility into kernel virtual addresses.
Domain address space creation:
alloc vm_space_t
alloc R1 table (16 KB, order=2, ZONE_NORMAL)
initialize all R1 entries as invalid (Z_I_BIT set)
set vm_space.pgtbl_root = phys(R1)
set vm_space.asce = encode_asce(phys(R1), DT=R1, TL=2048)
// Domain's ASCE is loaded into CR1 on context switch to this domain.
// Kernel ASCE remains in a separate register save area.
3.3 Slab and Object Allocator
3.3.1 Magazine-Depot Model
The slab allocator uses a magazine-depot architecture for per-CPU caching of fixed-size objects.
Per-CPU layer (no lock needed, IRQs disabled):
┌──────────────────────────────────────────┐
│ Hot magazine [obj0│obj1│obj2│...│objN] │ ← alloc/free here
│ Cold magazine [obj0│obj1│... ] │ ← swap with hot when full/empty
└──────────────────────────────────────────┘
↕ swap (acquire depot lock)
Global depot layer (spinlock):
┌──────────────────────────────────────────┐
│ Full magazines: [mag_a][mag_b][mag_c] │
│ Empty magazines: [mag_d][mag_e] │
└──────────────────────────────────────────┘
↕ slab page allocation (acquire zone lock)
PMM (buddy allocator)
Allocation: pop from hot magazine. If empty, swap hot/cold. If cold also empty, fetch a full magazine from the depot. If depot has none, allocate a new slab page from PMM and populate a magazine.
Free: push to hot magazine. If full, swap hot/cold. If cold also full, return the cold magazine to the depot as a full magazine.
3.3.2 Storage Key Isolation
Each slab cache may be created with a non-zero s390x storage key. Pages backing that cache are assigned the specified key. A domain that does not hold the matching key in its PSW access key field will receive a protection exception if it attempts to access those pages.
Capability table pages use a dedicated storage key (key 1 by convention). This provides hardware-enforced isolation: even if a domain obtains a pointer to another domain's capability table, the storage key check will fault before any data is read.
3.3.3 Slab Strict Requirements
| # | Requirement |
|---|---|
| SLAB-1 | kmem_cache_alloc must not be called from hard-IRQ context unless the cache was created with atomic support. Use kmalloc(ZX_GFP_ATOMIC) from IRQ context. |
| SLAB-2 | kmem_cache_free must only be called with a pointer returned by kmem_cache_alloc on the same cache. Cross-cache free is undefined behavior. |
| SLAB-3 | Freed objects are poisoned with a sentinel pattern. Re-use before alloc is detected in debug builds. |
| SLAB-4 | kmem_cache_destroy must only be called after all objects have been returned. Outstanding objects at destroy time is a kernel panic. |
3.4 Capability Memory
Capability tables are the most security-sensitive data structure in the kernel. They receive special treatment beyond the standard slab rules.
3.4.1 Allocation
Capability tables are allocated from a dedicated slab cache:
- Storage key: 1 (non-zero, distinct from general kernel data at key 0).
- GFP flags:
ZX_GFP_NORMALonly. Capability tables are never allocated from the atomic reserve. - Pages are marked
PF_PINNEDimmediately after allocation. They are never reclaimed, swapped, or migrated.
3.4.2 Lifetime
A capability table is created atomically with its domain. It is destroyed atomically when the domain dies. The destruction sequence is:
domain_destroy(domain):
// 1. Freeze the domain: no new capabilities may be minted into it.
domain.state = DOMAIN_DYING
// 2. Revoke all capabilities in the table.
for slot in domain.cap_table:
if cap_table[slot].type != CAP_TYPE_INVALID:
cap_revoke_slot(domain, slot)
// 3. Free the table pages.
kmem_cache_free(cap_table_cache, domain.cap_table)
// 4. Drop the domain kobject reference.
koms_put(domain.kobj)
Step 2 increments the generation counter on every object the domain held capabilities to. This atomically invalidates all derived capabilities that other domains may have received from this domain.
3.4.3 Capability Memory Strict Requirements
| # | Requirement |
|---|---|
| CAP-MEM-1 | Capability table pages must be PF_PINNED. They are never reclaimed. |
| CAP-MEM-2 | Capability table pages use storage key 1. General kernel data uses key 0. |
| CAP-MEM-3 | Capability table destruction must complete before the domain's vm_space_t is torn down. |
| CAP-MEM-4 | No capability token may be stored in user-accessible memory. The kernel never copies a raw token to user space. |
3.5 Memory for IPC
IPC memory is designed to minimize allocation on the critical path.
3.5.1 Synchronous IPC — Zero Allocation
Small synchronous messages (up to 8 × 64-bit registers) are passed entirely in CPU registers. The kernel performs a direct thread switch: the sender's GPRs 2–9 become the receiver's GPRs 2–9. No kernel buffer is allocated. No memory is touched beyond the two threads' kernel stacks.
3.5.2 Asynchronous Queue — Fixed-Capacity Ring Buffer
Each IPC endpoint that supports async messaging owns a fixed-capacity ring buffer, allocated from the slab at endpoint creation time. The capacity is specified at creation and never changes.
Async message queue (ring buffer):
head ──► ┌──────────────────────────────────────────┐
│ msg[0]: tag | regs[8] | caps[4] │
│ msg[1]: tag | regs[8] | caps[4] │
│ msg[2]: (empty) │
│ ... │
tail ──► │ msg[N-1]: (empty) │
└──────────────────────────────────────────┘
capacity = N (fixed at endpoint creation)
each message slot = 136 bytes (8 + 8×8 + 4×8)
The ring buffer is allocated with ZX_GFP_NORMAL and is never reallocated.
If the queue is full, the send operation returns ERR_QUEUE_FULL to the
sender. The sender is responsible for retry or backpressure.
3.5.3 Shared Memory — Zero-Copy Large Transfer
For bulk data transfer, the sender grants a CAP_MAP capability on a VMA.
The receiver maps the VMA into its own address space via vmm_insert_vma.
No kernel buffer is involved. The physical pages are shared between the two
address spaces via DAT table entries pointing to the same physical frames.
Shared memory transfer:
Sender domain Receiver domain
vm_space_t vm_space_t
┌──────────────────┐ ┌──────────────────┐
│ VMA [A, B) │ │ VMA [C, D) │
│ prot: R/W │ │ prot: R (derived)│
└────────┬─────────┘ └────────┬─────────┘
│ DAT entries │ DAT entries
└──────────────┬─────────────────┘
▼
Physical frames [P0, P1, ...]
The receiver's mapping uses the rights from the CAP_MAP capability. If the
capability grants only CAP_READ, the receiver's DAT entries are read-only.
A write attempt generates a protection exception in the receiver's domain,
not a kernel panic.
4. IPC Subsystem
4.1 Design Goals
IPC is the primary communication mechanism between all domains. Because ZXFoundation™ is a microkernel, IPC performance directly determines system throughput. The design targets:
- Synchronous fastpath latency: < 1 µs on z/Architecture (single hop, no contention, small message).
- Async queue throughput: limited only by memory bandwidth and ring buffer capacity.
- Zero kernel allocation on the synchronous fastpath.
- Capability transfer atomicity: a capability moved in a message is never visible in both sender and receiver simultaneously.
4.2 IPC Endpoint
An IPC endpoint is a kernel object (kobject_t, type KOBJ_TYPE_ENDPOINT).
It is the rendezvous point for IPC. A domain that wishes to receive messages
creates an endpoint and publishes a capability to it.
Endpoint state:
ENDPOINT_IDLE — no sender or receiver waiting
ENDPOINT_RECV_WAIT — a receiver thread is blocked, waiting for a message
ENDPOINT_SEND_WAIT — one or more sender threads are queued (async overflow)
An endpoint is addressed exclusively by capability. A domain that does not hold a capability to an endpoint cannot send to or receive from it.
4.3 Synchronous Fastpath
The synchronous fastpath is the primary IPC mechanism. It is used when the receiver is already blocked on the endpoint.
Synchronous IPC fastpath:
Sender Kernel Receiver
│ │ │
│ ipc_call(ep_cap, │ │
│ regs[0..7]) │ │
├─────────────────────────►│ │
│ │ cap_lookup(ep_cap) │
│ │ endpoint.state == │
│ │ RECV_WAIT? YES │
│ │ │
│ │ copy regs[0..7] to │
│ │ receiver kernel stack │
│ │ │
│ │ transfer caps (if any) │
│ │ from sender table to │
│ │ receiver table │
│ │ │
│ │ direct thread switch: │
│ [blocked] │ sender → BLOCKED │
│ │ receiver → RUNNING │
│ ├─────────────────────────►│
│ │ │ regs[0..7]
│ │ │ available
│ │ │
│ │ receiver calls │
│ │ ipc_reply(regs[0..7]) │
│ │◄─────────────────────────┤
│ │ direct thread switch: │
│ │ receiver → BLOCKED │
│◄─────────────────────────┤ sender → RUNNING │
│ regs[0..7] = reply │ │
The direct thread switch bypasses the scheduler run queue entirely. The kernel saves the sender's context, restores the receiver's context, and returns to user space in the receiver. This is the seL4-style fastpath.
Fastpath conditions (all must hold; any failure falls back to slow path):
- Endpoint state is
RECV_WAIT. - Message fits in 8 registers (no large payload).
- At most 4 capability handles transferred.
- Receiver thread is on the same CPU (avoids cross-CPU IPI on fastpath).
4.4 Asynchronous Queue Fallback
When the fastpath conditions are not met, the message is enqueued in the endpoint's ring buffer and the sender continues without blocking.
Async send path:
ipc_send_async(ep_cap, msg):
endpoint = cap_lookup(ep_cap, CAP_EXEC)
acquire endpoint.lock (spinlock, irqsave)
if ring_buffer_full(endpoint.queue):
release endpoint.lock
return ERR_QUEUE_FULL
ring_buffer_enqueue(endpoint.queue, msg)
if endpoint.state == RECV_WAIT:
// Wake the receiver.
thread_wake(endpoint.waiting_receiver)
endpoint.state = ENDPOINT_IDLE
release endpoint.lock
return OK
ipc_recv(ep_cap):
endpoint = cap_lookup(ep_cap, CAP_EXEC)
acquire endpoint.lock
if ring_buffer_empty(endpoint.queue):
endpoint.state = RECV_WAIT
endpoint.waiting_receiver = current_thread
release endpoint.lock
thread_block() // deschedule; woken by sender
// On wake: message is in thread's IPC buffer
return OK
msg = ring_buffer_dequeue(endpoint.queue)
release endpoint.lock
return msg
4.5 Message Structure
Every IPC message has the same fixed structure regardless of path:
IPC Message (136 bytes):
┌──────────────────────────────────────────────────────────┐
│ tag [63:0] — message type / protocol identifier │
├──────────────────────────────────────────────────────────┤
│ regs[0] [63:0] ─┐ │
│ regs[1] [63:0] │ │
│ ... │ 8 × 64-bit data words │
│ regs[7] [63:0] ─┘ │
├──────────────────────────────────────────────────────────┤
│ caps[0] [63:0] ─┐ │
│ caps[1] [63:0] │ 4 × capability handles │
│ caps[2] [63:0] │ (slot indices in sender's table) │
│ caps[3] [63:0] ─┘ │
└──────────────────────────────────────────────────────────┘
Total: 1 + 8 + 4 = 13 × 8 = 104 bytes of payload
+ 4 bytes padding = 136 bytes per slot
4.6 Capability Transfer
Capabilities included in a message (caps[0..3]) are transferred with
move semantics: the kernel atomically removes the capability from the
sender's table and inserts it into the receiver's table. The sender's slot
is cleared. The capability is never simultaneously visible in both tables.
cap_transfer(sender, receiver, sender_slot):
acquire sender.cap_table.lock
acquire receiver.cap_table.lock // always in address order to avoid deadlock
token = sender.cap_table[sender_slot]
assert token.type != CAP_TYPE_INVALID
dst_slot = cap_table_alloc(receiver.cap_table)
receiver.cap_table[dst_slot] = token
sender.cap_table[sender_slot] = CAP_INVALID
release receiver.cap_table.lock
release sender.cap_table.lock
return dst_slot
4.7 IPC and KOMS
IPC endpoints are kobject_t instances registered in the KOMS namespace
under the owning domain's subtree. A domain may publish an endpoint by
name, allowing other domains to discover it via koms_ns_find_get and
then request a capability from a trusted broker.
KOMS namespace (IPC endpoints):
koms_root_ns
└── "domains"
├── "block-io"
│ └── "ep.request" ← IPC endpoint kobject
├── "filesystem"
│ └── "ep.request"
└── "console"
└── "ep.write"
5. Process and Domain Model
5.1 Fundamental Units
ZXFoundation™ defines two fundamental execution units:
-
Domain: the unit of isolation. Owns an address space (
vm_space_t), a capability table, and one or more threads. Analogous to a process in a monolithic kernel, but the kernel makes no distinction between a "driver domain" and an "application domain." -
Thread: the unit of scheduling. Belongs to exactly one domain. Has a kernel stack, a saved register set (
irq_frame_t), and a scheduling state. Threads within the same domain share the domain's address space and capability table.
5.2 Domain Lifecycle
domain_create()
│
▼
┌───────────────┐
│ CREATING │ — address space allocated,
└───────┬───────┘ capability table allocated,
│ initial thread created
▼
┌───────────────┐
│ RUNNING │◄──── threads scheduled normally
└───────┬───────┘
│
┌───────────┼───────────┐
│ │ │
domain_kill unhandled watchdog
│ fault timeout
│ │ │
▼ ▼ │
┌──────────┐ ┌──────────┐ │
│ DYING │ │ FAULTED │◄───┘
└────┬─────┘ └────┬─────┘
│ │
│ supervisor domain
│ decides: restart or kill
│ │
│ ┌──────┴──────┐
│ │ │
│ restart kill
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │RESTARTING│ │
│ └────┬─────┘ │
│ │ │
│ ▼ ▼
│ ┌────────┐ ┌──────┐
└─►│ DEAD │ │ DEAD │
└────────┘ └──────┘
5.3 Domain Structure
A domain is a kobject_t of type KOBJ_TYPE_DOMAIN. It embeds:
Domain object:
kobject_t kobj — KOMS base (lifecycle, namespace, events)
vm_space_t space — address space (ASCE, VMA tree)
cap_table_t cap_table — capability table
list_head_t threads — list of owned threads
spinlock_t lock — protects state transitions
domain_state_t state — CREATING/RUNNING/FAULTED/RESTARTING/DEAD
uint32_t domain_id — globally unique identifier
kobject_t *supervisor — domain that receives fault events (may be null)
uint64_t heartbeat_seq — watchdog sequence number
5.4 Thread Structure
A thread is a kobject_t of type KOBJ_TYPE_THREAD. It embeds:
Thread object:
kobject_t kobj — KOMS base
domain_t *domain — owning domain (non-null, immutable)
irq_frame_t saved_regs — GPRs, FPRs, PSW (saved on context switch)
uint64_t kernel_stack — kernel stack top (virtual address)
thread_state_t state — RUNNABLE/RUNNING/BLOCKED/DEAD
sched_entity_t sched — scheduler run queue linkage
uint32_t priority — scheduling priority class
uint64_t cpu_mask — CPU affinity bitmask
uint64_t user_timer — accumulated user-mode CPU time (ns)
uint64_t sys_timer — accumulated kernel-mode CPU time (ns)
5.5 Fault Containment
When a domain faults (unhandled program check, protection exception, or watchdog timeout), the kernel:
- Suspends all threads in the domain (sets state to
BLOCKED). - Sets domain state to
FAULTED. - Fires
KOBJ_EVENT_DOMAIN_FAULTon the domain's kobject. - If the domain has a registered supervisor, delivers an IPC message to the supervisor's fault endpoint containing the fault code and domain ID.
- The supervisor decides: call
domain_restartordomain_kill.
If no supervisor is registered, the kernel kills the domain immediately. The kernel itself never panics due to a domain fault.
Fault containment flow:
Domain D faults
│
▼
kernel suspends D's threads
D.state = FAULTED
koms_event_fire(D, KOBJ_EVENT_DOMAIN_FAULT)
│
├── supervisor registered?
│ YES NO
│ │ │
▼ ▼ ▼
IPC message to supervisor domain_kill(D)
{ fault_code, domain_id }
│
├── supervisor calls domain_restart(D)
│ │
│ ▼
│ D.state = RESTARTING
│ reset address space
│ reset capability table
│ restart initial thread
│ D.state = RUNNING
│
└── supervisor calls domain_kill(D)
│
▼
D.state = DEAD
destroy address space
destroy capability table
koms_put(D)
5.6 Server Domains
A server domain is a domain that provides a service to other domains. It is distinguished from a user domain only by convention and registration:
- It registers one or more IPC endpoints in the KOMS namespace under a
well-known path (e.g.,
"domains/block-io/ep.request"). - It registers a supervisor domain (typically the system manager domain) that will restart it on fault.
- It registers a heartbeat capability with the kernel watchdog.
The kernel has no built-in concept of "driver" or "system service." All server domains are equal in privilege. Their authority derives entirely from the capabilities they hold.
5.7 KOMS Domain Hierarchy
koms_root_ns
└── "domains"
├── "system-manager" ← supervisor for all server domains
│ ├── "ep.fault" ← receives fault events
│ └── threads/
│ └── "main"
├── "block-io"
│ ├── "ep.request"
│ └── threads/
│ └── "worker-0"
├── "filesystem"
│ ├── "ep.request"
│ └── threads/
│ └── "worker-0"
└── "user-shell"
└── threads/
└── "main"
6. Scheduler
6.1 Design Goals
ZXFoundation™ targets throughput/batch workloads: long-running server domains, high CPU utilization, and minimal context-switch overhead. The scheduler is not designed for sub-millisecond interactive latency. It is designed to keep all CPUs busy and to minimize the overhead of scheduling decisions on the hot path.
6.2 Priority Classes
The scheduler defines three priority classes, processed in strict order:
| Class | Value | Quantum | Use case |
|---|---|---|---|
SCHED_REALTIME | 0 (highest) | 1 ms | Watchdog thread, IPC notification threads |
SCHED_BATCH | 1 | 10 ms | Server domains, user processes |
SCHED_IDLE | 2 (lowest) | unbounded | Idle loop (runs only when no other work) |
A SCHED_REALTIME thread always preempts a SCHED_BATCH or SCHED_IDLE
thread. A SCHED_BATCH thread always preempts SCHED_IDLE. Within a
class, scheduling is round-robin.
The 10 ms batch quantum is chosen to match the z/Architecture TOD clock resolution and to amortize context-switch overhead over a meaningful amount of work. Server domains that perform I/O will voluntarily yield (block on IPC receive) long before the quantum expires.
6.3 Per-CPU Run Queues
Each CPU maintains three run queues, one per priority class. Run queues
are doubly-linked lists of sched_entity_t nodes embedded in thread objects.
Per-CPU scheduler state (one per CPU):
┌─────────────────────────────────────────────────────────┐
│ CPU N │
│ │
│ current_thread ──► [thread currently running] │
│ │
│ rq[SCHED_REALTIME]: [t_a] ↔ [t_b] ↔ ∅ │
│ rq[SCHED_BATCH]: [t_c] ↔ [t_d] ↔ [t_e] ↔ ∅ │
│ rq[SCHED_IDLE]: [idle_thread] ↔ ∅ │
│ │
│ rq_lock (spinlock, irqsave) │
│ nr_running (total threads across all queues) │
└─────────────────────────────────────────────────────────┘
The rq_lock is a per-CPU spinlock. It is held only during run queue
manipulation (enqueue, dequeue, pick_next). It is never held across a
context switch.
6.4 Scheduling Decision
The scheduler is invoked from three points:
- CPU timer interrupt (quantum expiry).
thread_block()— a thread voluntarily deschedules (e.g., IPC receive).thread_wake()— a thread is made runnable (e.g., IPC send wakes receiver).
schedule():
acquire rq_lock (irqsave)
next = pick_next_thread(current_cpu)
if next == current_thread:
release rq_lock
return // no switch needed
prev = current_thread
current_thread = next
release rq_lock
context_switch(prev, next) // saves prev, restores next, returns in next
pick_next_thread(cpu):
for class in [SCHED_REALTIME, SCHED_BATCH, SCHED_IDLE]:
if rq[class] not empty:
thread = rq[class].head
list_rotate(rq[class]) // round-robin: move head to tail
return thread
return idle_thread // always non-null
6.5 Context Switch
A context switch saves the outgoing thread's full CPU state and restores the incoming thread's state. On z/Architecture this includes:
- 16 × 64-bit general-purpose registers (GPRs 0–15)
- 16 × 64-bit floating-point registers (FPRs 0–15)
- Program Status Word (PSW: mask + instruction address)
- 16 × 32-bit access registers (ARs 0–15)
- CPU timer value (STPTC / SPTC)
The kernel stack pointer (GPR 15) is saved in the thread's saved_regs
and restored on the next switch. The domain's ASCE is loaded into CR1
when switching between domains.
Context switch sequence:
context_switch(prev, next):
// Save prev state to prev.saved_regs
STMG R0,R15, prev.saved_regs.gprs
STFPC prev.saved_regs.fpc
STPTC prev.saved_regs.cpu_timer
// Update time accounting
prev.sys_timer += (STCK() - lowcore.sys_enter_timer)
// Switch address space if domains differ
if prev.domain != next.domain:
LCTLG CR1, next.domain.space.asce
// TLB is tagged by ASCE; no explicit flush needed on z/Arch
// Restore next state
LPTC next.saved_regs.cpu_timer
LFPC next.saved_regs.fpc
LMG R0,R15, next.saved_regs.gprs
// lowcore.current_task = next (for fault handler identification)
lowcore.current_task = next
lowcore.sys_enter_timer = STCK()
// Return in next thread's context
6.6 Work Stealing
When a CPU's run queues are empty (only the idle thread is runnable), the CPU attempts to steal work from the busiest CPU.
Work stealing:
idle_loop(cpu):
while true:
victim = find_busiest_cpu() // scan per-CPU nr_running
if victim == null or victim.nr_running <= 1:
arch_cpu_relax() // DIAG 0x44 (z/Arch yield hint)
continue
acquire victim.rq_lock (irqsave)
acquire cpu.rq_lock (irqsave) // always in cpu_id order
steal_half(victim, cpu)
release cpu.rq_lock
release victim.rq_lock
break
Stealing moves half the victim's SCHED_BATCH threads to the idle CPU.
SCHED_REALTIME threads are never stolen — they are pinned to their
assigned CPU by the IPI mechanism.
6.7 CPU Affinity
A thread may be pinned to a subset of CPUs via its cpu_mask field. The
scheduler respects affinity: pick_next_thread skips threads whose
cpu_mask does not include the current CPU. Work stealing also respects
affinity: a thread is only stolen if the stealing CPU is in the thread's
cpu_mask.
Affinity is set at thread creation via a capability-gated syscall. The
capability must grant CAP_WRITE on the thread object.
7. Time Subsystem
7.1 Hardware Time Sources
z/Architecture provides three hardware time mechanisms, all per-CPU:
| Source | Instruction | Type | Resolution | Use |
|---|---|---|---|---|
| TOD clock | STCK / STCKF | Global, monotonic | ~0.24 ns (2^-12 µs) | Wall time, ktime_get |
| CPU timer | SPTC / STPTC | Per-CPU countdown | Same as TOD | Scheduler preemption |
| Clock comparator | SCKC / STCKC | Per-CPU absolute | Same as TOD | Sleep / timeout |
The TOD clock is a single hardware clock shared across all CPUs. It is
monotonic and does not wrap in any practical timeframe (64-bit, ~143 years
at full resolution). STCKF reads it without serialization — it is safe
from any context including hard-IRQ.
7.2 Kernel Time (ktime_t)
ktime_t is a 64-bit nanosecond count since kernel boot. It is derived
from the TOD clock with a boot-time offset computed during pmm_init.
TOD clock value (raw):
bits 63:0 = TOD units (1 TOD unit = 2^-12 µs ≈ 0.244 ns)
ktime conversion:
ktime_ns = (tod_raw - tod_boot_offset) * 125 / 512
= (tod_raw - tod_boot_offset) >> 2 (approximate, 4 ns resolution)
Exact: 1 TOD unit = 1000/4096 ns
ktime_ns = tod_delta * 1000 / 4096
ktime_get() reads STCKF and applies the conversion. It is callable
from any context, holds no lock, and never sleeps.
7.3 CPU Timer and Scheduler Preemption
The CPU timer is a per-CPU countdown register. When it reaches zero, a CPU timer interrupt fires (external interrupt, subclass 0x1004). The kernel uses this to enforce scheduler quanta.
Quantum setup (on context switch to a new thread):
quantum_tod = thread.priority == SCHED_REALTIME ? 1_ms_in_tod
: 10_ms_in_tod
SPTC -quantum_tod // load negative value; counts up to zero
CPU timer interrupt handler:
// Fires when CPU timer reaches zero (overflows from negative to positive)
sched_tick() // account time, check if quantum expired
if quantum_expired:
schedule() // pick next thread
else:
return // spurious or early; reload timer
7.4 Clock Comparator and Timer Wheel
The clock comparator fires an external interrupt when the TOD clock reaches
a programmed absolute value. The kernel uses this for sleep and timeout
operations.
The timer wheel is a per-CPU hierarchical structure with 8 levels and 64 slots per level. Each slot covers a time range; the resolution doubles at each level.
Timer wheel (per CPU):
Level 0: 64 slots × 1 ms = 64 ms range (fine-grained)
Level 1: 64 slots × 64 ms = 4 s range
Level 2: 64 slots × 4 s = 256 s range
...
Level 7: 64 slots × ... = years range (coarse)
Each slot: list of timer_t objects expiring in that window
On clock comparator interrupt:
advance current slot pointer
fire all timers in the current slot
if level 0 wraps: cascade from level 1, etc.
program clock comparator for next non-empty slot
Timer callbacks execute in softirq context — after the hard-IRQ handler returns, before returning to user space. They must not block, must not acquire spinlocks held by hard-IRQ handlers, and must complete in bounded time.
7.5 Time Accounting
Per-thread time accounting uses the lowcore timing fields:
Kernel entry (SVC, PGM, EXT, IO):
lowcore.sys_enter_timer = STCK()
Kernel exit (return to user space):
elapsed = STCK() - lowcore.sys_enter_timer
current_thread.sys_timer += elapsed
lowcore.exit_timer = STCK()
User time (updated on kernel entry):
user_elapsed = lowcore.sys_enter_timer - lowcore.exit_timer
current_thread.user_timer += user_elapsed
7.6 Time Strict Requirements
| # | Requirement |
|---|---|
| TIME-1 | ktime_get() must be callable from any context including hard-IRQ. It reads STCKF directly — no lock, no sleep. |
| TIME-2 | Timer callbacks execute in softirq context. They must not block or acquire locks held by hard-IRQ handlers. |
| TIME-3 | The CPU timer must be reloaded on every context switch. A thread must never run beyond its quantum without a timer interrupt. |
| TIME-4 | The clock comparator must be reprogrammed after every timer wheel advance to the next non-empty slot. |
| TIME-5 | tod_boot_offset is computed once during pmm_init and never modified. |
8. Trap and System Call Architecture
8.1 Interrupt Classes
z/Architecture defines six hardware interrupt classes. Each has a dedicated new PSW slot in the lowcore and a dedicated entry point in the kernel.
| Class | Lowcore offset | Trigger | Kernel handler |
|---|---|---|---|
RESTART | 0x01A0 | SIGP RESTART (AP bringup) | restart_handler |
EXTERNAL | 0x01B0 | CPU timer, clock comparator, SIGP, service call | ext_handler |
SVC | 0x01C0 | SVC n instruction (system call) | svc_handler |
PROGRAM | 0x01D0 | Page fault, protection exception, illegal instruction | pgm_handler |
MCCK | 0x01E0 | Machine check (hardware error) | mcck_handler |
IO | 0x01F0 | Channel subsystem I/O completion | io_handler |
8.2 Entry Path
All interrupt classes share the same entry structure:
Hardware interrupt fires:
1. Hardware saves old PSW to lowcore (e.g., svc_old_psw at 0x0140).
2. Hardware saves interrupt parameters to lowcore
(e.g., svc_code at 0x008A for SVC).
3. Hardware loads new PSW from lowcore (e.g., svc_new_psw at 0x01C0).
4. Execution begins at the kernel entry stub.
Kernel entry stub (assembly):
STMG R0,R15, lowcore.save_area_sync // save all GPRs
// Build irq_frame_t on kernel stack:
// gprs[16], psw (from lowcore old PSW), ilc, code
LG R15, lowcore.kernel_stack // switch to kernel stack
BRASL R14, <C handler> // call C dispatcher
// On return: restore GPRs, LPSWE to return PSW
LMG R0,R15, frame.gprs
LPSWE frame.psw
The irq_frame_t on the kernel stack is the canonical representation of
the interrupted context. It is used by the fault handler, the debugger,
and the context switch path.
8.3 SVC — System Call Dispatch
ZXFoundation™ defines its own system call table. There is no POSIX
compatibility layer. The SVC number is in lowcore.svc_code (16-bit).
Arguments follow the SysV ABI: GPRs 2–7. Return value in GPR 2.
Every system call that operates on a kernel object takes a capability
handle as its first argument (GPR 2). The kernel validates the capability
before performing any operation. An invalid or insufficient capability
returns ERR_CAP_INVALID immediately.
SVC dispatch:
svc_handler(frame):
svc_nr = lowcore.svc_code & 0xFF
if svc_nr >= ZX_SYSCALL_MAX:
return ERR_INVALID_SYSCALL
cap_handle = frame.gprs[2]
object, rights = cap_lookup(current_domain, cap_handle)
if object == null:
return ERR_CAP_INVALID
return syscall_table[svc_nr](object, rights, frame)
ZXFoundation™ v1 system call surface (~32 syscalls):
| Number | Name | Capability type | Description |
|---|---|---|---|
| 0 | zx_cap_derive | any | Derive a capability with reduced rights |
| 1 | zx_cap_transfer | any + CAP_GRANT | Transfer a capability via IPC message |
| 2 | zx_cap_revoke | any + CAP_REVOKE | Revoke all derived capabilities |
| 3 | zx_domain_create | domain factory | Create a new domain |
| 4 | zx_domain_kill | domain + CAP_DESTROY | Kill a domain |
| 5 | zx_domain_restart | domain + CAP_WRITE | Restart a faulted domain |
| 6 | zx_thread_create | domain + CAP_WRITE | Create a thread in a domain |
| 7 | zx_thread_start | thread + CAP_EXEC | Start a thread at a given address |
| 8 | zx_thread_exit | — | Terminate the calling thread |
| 9 | zx_ipc_call | endpoint + CAP_EXEC | Synchronous IPC call |
| 10 | zx_ipc_recv | endpoint + CAP_EXEC | Block waiting for a message |
| 11 | zx_ipc_reply | — | Reply to a synchronous call |
| 12 | zx_ipc_send | endpoint + CAP_EXEC | Async send (non-blocking) |
| 13 | zx_mem_map | VMA + CAP_MAP | Map a VMA into the calling domain |
| 14 | zx_mem_unmap | VMA + CAP_WRITE | Unmap a VMA |
| 15 | zx_mem_alloc | domain + CAP_WRITE | Allocate anonymous memory |
| 16 | zx_endpoint_create | domain + CAP_WRITE | Create an IPC endpoint |
| 17 | zx_endpoint_destroy | endpoint + CAP_DESTROY | Destroy an endpoint |
| 18 | zx_time_get | — | Read ktime_t (no capability needed) |
| 19 | zx_sleep | — | Sleep for a duration |
| 20 | zx_yield | — | Voluntarily yield the CPU |
| 21 | zx_watchdog_register | domain + CAP_WRITE | Register a heartbeat capability |
| 22 | zx_watchdog_heartbeat | watchdog cap | Signal liveness to the watchdog |
| 23–31 | reserved | Future use |
8.4 PGM — Program Check Handler
The program check handler dispatches on lowcore.pgm_code:
pgm_handler(frame):
code = lowcore.pgm_code
addr = lowcore.trans_exc_code // faulting virtual address (if applicable)
switch code:
case PGM_TRANSLATION_EXCEPTION: // page fault
vma = vmm_find_vma(current_domain.space, addr)
if vma == null:
goto domain_fault // no mapping → domain fault
page = pmm_alloc_page(ZX_GFP_NORMAL)
if page == null:
goto domain_fault // OOM → domain fault
mmu_map_page(current_domain.space, addr, page, vma.vm_prot)
return // retry the faulting instruction
case PGM_PROTECTION_EXCEPTION: // write to read-only page, or key mismatch
goto domain_fault
case PGM_PRIVILEGED_OPERATION: // user tried a privileged instruction
goto domain_fault
case PGM_SPECIFICATION_EXCEPTION: // alignment or format error
goto domain_fault
default:
goto domain_fault
domain_fault:
domain_suspend(current_domain)
deliver_fault_event(current_domain, code, addr)
schedule() // switch to another thread
A program check in kernel context (PSW problem-state bit = 0 at the time of the fault) is always a kernel panic. The kernel must not generate translation exceptions or protection exceptions in its own address space.
8.5 EXT — External Interrupt Handler
ext_handler(frame):
code = lowcore.ext_int_code
switch code:
case EXT_CPU_TIMER (0x1004):
sched_tick()
if quantum_expired: schedule()
case EXT_CLOCK_COMPARATOR (0x1005):
timer_wheel_advance(current_cpu)
program_clock_comparator(next_expiry)
case EXT_SERVICE_CALL (0x2401):
sclp_service_call_handler() // SCLP response (console, hardware info)
case EXT_SIGP_EMERGENCY (0x1201):
ipi_handler() // cross-CPU IPI (TLB shootdown, CPU offline)
default:
// Unknown external interrupt: log and ignore.
8.6 IO — Channel Subsystem Interrupt Handler
io_handler(frame):
schid.sch_no = lowcore.subchannel_nr
schid.ssid = lowcore.subchannel_id >> 16
// Read the Interrupt Response Block (IRB) via TSCH.
TSCH schid, irb
// Look up the IRQ descriptor for this subchannel.
desc = irq_lookup_by_schid(schid)
if desc == null:
return // spurious; no handler registered
// Dispatch to the registered handler.
// The handler is typically the block-I/O server domain's IPC endpoint.
desc.handler(desc, &irb)
The I/O handler is intentionally minimal. It reads the IRB and dispatches to a registered handler. The handler is responsible for notifying the appropriate server domain via IPC. The kernel does not interpret I/O completion data.
9. Machine-Check Recovery and Watchdog
9.1 Machine-Check Classification
When a machine-check interrupt fires, lowcore.mcck_interruption_code
classifies the error. The kernel classifies each error as recoverable or
unrecoverable:
| Error class | Recoverable? | Action |
|---|---|---|
| Storage error (corrected) | Yes | Log; mark page suspect; continue |
| Storage error (uncorrected) | No | Offline affected frames; migrate domains |
| CPU malfunction | No | Offline CPU; migrate its domains |
| Timing facility error | Yes | Re-sync TOD; log |
| External damage | No | Kernel panic (hardware integrity lost) |
9.2 Machine-Check Recovery Flow
mcck_handler(frame):
code = lowcore.mcck_interruption_code
if code & MCCK_SD: // system damage — unrecoverable
goto kernel_panic
if code & MCCK_ST: // storage error
addr = lowcore.failing_storage_address
page = phys_to_page(addr)
if code & MCCK_ST_CORRECTED:
pmm_mark_suspect(page) // log; keep in service
else:
pmm_offline_page(page) // remove from buddy; migrate domains
domain_migrate_from_page(page)
if code & MCCK_CPU: // CPU malfunction
cpu_offline(current_cpu) // SIGP STOP self after migration
domain_migrate_all(current_cpu)
SIGP STOP, current_cpu_addr
// Recoverable: return to interrupted context.
LPSWE frame.psw
9.3 CPU Offline and Domain Migration
When a CPU is taken offline (due to MCCK or operator request):
cpu_offline(cpu):
// 1. Stop accepting new work.
cpu.state = CPU_OFFLINE_PENDING
// 2. Drain the run queue to other CPUs.
acquire cpu.rq_lock
for each thread in cpu.rq[SCHED_BATCH]:
target = find_least_loaded_cpu(thread.cpu_mask)
enqueue(target.rq[SCHED_BATCH], thread)
release cpu.rq_lock
// 3. Notify domains whose threads were migrated.
for each migrated_thread:
koms_event_fire(migrated_thread.domain, KOBJ_EVENT_DOMAIN_MIGRATE)
// 4. Stop the CPU.
cpu.state = CPU_OFFLINE
SIGP STOP, cpu.cpu_addr
9.4 Domain Watchdog
The kernel maintains a per-CPU watchdog thread at SCHED_REALTIME priority.
Each server domain that registers with the watchdog receives a heartbeat
capability. The domain must call zx_watchdog_heartbeat within a configured
interval (default: 5 seconds).
Watchdog state machine (per registered domain):
WATCHDOG_OK ──── heartbeat received ────► WATCHDOG_OK
│
│ interval elapsed without heartbeat
▼
WATCHDOG_WARN ──── heartbeat received ──► WATCHDOG_OK
│
│ second interval elapsed
▼
WATCHDOG_FAULT
│
▼
domain_fault(domain) // triggers fault containment flow (Section 5.5)
The watchdog thread runs on a dedicated CPU (CPU 0 by convention) and is
never migrated. It is the only SCHED_REALTIME thread that the kernel
creates at boot time.
9.5 Kernel Self-Check (syschk)
The existing zx_system_check() infrastructure is extended with severity
levels:
| Severity | Action |
|---|---|
ZX_SYSCHK_WARNING | Log to kernel ring buffer; continue |
ZX_SYSCHK_DEGRADED | Disable the affected subsystem; log; continue |
ZX_SYSCHK_CORE_CORRUPT | Disabled-wait PSW (kernel panic) |
ZX_SYSCHK_CORE_CORRUPT is reserved for conditions where kernel data
structures are known to be corrupted and continued execution would cause
silent data loss or security violations. All other conditions should use
WARNING or DEGRADED to maximize availability.
9.6 Storage Key Protection
Each domain is assigned a non-zero s390x storage key at creation time. All pages mapped into the domain's address space are assigned that key. The domain's PSW access key field is set to match.
A domain that attempts to access a page with a mismatched storage key receives a protection exception (PGM code 0x04). This is handled as a domain fault (Section 8.4) — the domain is suspended, not the kernel.
This provides a hardware-enforced memory isolation layer that operates independently of DAT. Even if a bug in the kernel's page table management accidentally maps a page from domain A into domain B's address space, the storage key check will prevent domain B from reading or writing it.
10. Long-Term Implementation Roadmap
10.1 Overview
The roadmap is organized into seven phases. Each phase has a clear prerequisite, a defined deliverable, and a set of subsystems it unlocks. Phases are sequential within a dependency chain but may overlap where dependencies permit.
Phase dependency graph:
[Phase 1: TCB Hardening]
│
▼
[Phase 2: Capability Foundation]
│
▼
[Phase 3: Domain and IPC]
│
┌────┴────┐
▼ ▼
[Phase 4: [Phase 6:
Server Memory
Domain Completion]
Infra]
│
▼
[Phase 5: First Server Domains]
│
▼
[Phase 7: Hardening and Observability]
10.2 Phase 1 — TCB Hardening
Prerequisite: Current state (PMM, VMM, slab, KOMS, IRQ, SMP, sync all functional).
Deliverables:
-
Trap/entry completion: Full
irq_frame_tsave/restore for all six interrupt classes. SVC, PGM, EXT, IO, MCCK, RESTART handlers dispatch to C. Return path restores full CPU state viaLPSWE. -
Time subsystem: TOD clock read (
STCKF),ktime_ttype andktime_get(). CPU timer setup and quantum enforcement. Clock comparator setup. Timer wheel (8 levels, 64 slots).ktime_sleep(). -
Scheduler — BATCH class: Per-CPU run queues.
schedule(),thread_block(),thread_wake(). Context switch (GPR/FPR/PSW save- restore). CPU timer interrupt →sched_tick(). Work stealing. Idle thread per CPU.
Unlocks: Phase 2 (capability system requires a running scheduler to test domain creation).
10.3 Phase 2 — Capability Foundation
Prerequisite: Phase 1 complete.
Deliverables:
-
Capability token: 64-bit structure, type/rights/gen/index fields.
cap_mint,cap_derive,cap_revoke,cap_lookup. -
Capability table: Slab cache with storage key 1. Per-domain flat array.
cap_table_alloc,cap_table_free.PF_PINNEDpages. -
KOMS extension:
kobject_tgainscap_gen(generation counter) andglobal_index(object table index). Global object table (flat array, spinlock-protected).koms_init_objregisters in table.koms_putat zero incrementscap_genbefore freeing. -
Syscalls 0–2:
zx_cap_derive,zx_cap_transfer,zx_cap_revoke. SVC dispatch table. Capability validation on every syscall entry.
Unlocks: Phase 3 (domain creation requires capability tables).
10.4 Phase 3 — Domain and IPC
Prerequisite: Phase 2 complete.
Deliverables:
-
Domain object:
domain_tkobject type.vm_space_tcreation per domain. Capability table allocation at domain birth. Domain lifecycle state machine.domain_create,domain_kill. -
Thread object:
thread_tkobject type. Kernel stack allocation.thread_create,thread_start,thread_exit. Integration with scheduler (enqueue onthread_start). -
SVC entry — capability validation: Every syscall validates its capability argument before proceeding.
ERR_CAP_INVALIDon failure. -
IPC sync fastpath:
zx_ipc_call,zx_ipc_recv,zx_ipc_reply. Direct thread switch. Register-passing (GPRs 2–9). Fastpath conditions enforced. -
IPC async queue: Ring buffer slab allocation.
zx_ipc_send. Enqueue/dequeue. Receiver wake on enqueue. -
Syscalls 3–17: Full domain, thread, memory, and endpoint syscalls.
Unlocks: Phase 4 and Phase 6 (both depend on working domains and IPC).
10.5 Phase 4 — Server Domain Infrastructure
Prerequisite: Phase 3 complete.
Deliverables:
-
Fault containment:
domain_suspend,deliver_fault_event. Fault event IPC to supervisor domain.domain_restart,domain_killfrom supervisor. -
Domain watchdog: Watchdog thread at
SCHED_REALTIME. Heartbeat capability.zx_watchdog_register,zx_watchdog_heartbeat. Two-strike fault trigger. -
MCCK recovery: Storage error classification.
pmm_offline_page. CPU offline and domain migration.KOBJ_EVENT_DOMAIN_MIGRATE. -
Storage key assignment: Per-domain key allocation. Page key assignment on
vmm_insert_vma. PSW access key set on context switch. -
System manager domain: The first server domain, started by the kernel at boot. Receives fault events for all other server domains. Implements restart policy.
Unlocks: Phase 5 (server domains require fault containment to be safe).
10.6 Phase 5 — First Server Domains
Prerequisite: Phase 4 complete.
Deliverables:
-
Console server: Wraps DIAG 0x08 / SCLP. Exposes
ep.writeendpoint. Acceptszx_ipc_sendwith a string payload. Replacesprintkfor user-visible output. -
Channel I/O server: Wraps CSS interrupt dispatch. Accepts subchannel registration from other domains. Exposes
ep.requestfor I/O submission. Returns I/O completion via IPC reply. -
Block I/O server: Built on channel I/O server. Implements ECKD (DASD) read/write. Exposes
ep.requestwith a block I/O protocol. -
Filesystem server (minimal): Built on block I/O server. Implements a read-only flat filesystem (sufficient to load user programs). Exposes
ep.open,ep.read.
Unlocks: Phase 7 (hardening requires a running system to test against).
10.7 Phase 6 — Memory Management Completion
Prerequisite: Phase 3 complete (can proceed in parallel with Phase 4/5).
Deliverables:
-
Demand paging: PGM translation exception →
vmm_find_vma→pmm_alloc_page→mmu_map_page→ retry. Anonymous and file-backed VMAs. -
Copy-on-write:
VM_COWflag on shared VMAs. Write protection fault → page copy → remap. Used for domain cloning (fork-like semantics). -
Page reclaim: LRU list per zone. Reclaim under memory pressure (triggered when
ZONE_NORMAL.free_pages < LOW_WATERMARK). Reclaim selects cold anonymous pages; writes dirty pages to swap device. -
Swap: Capability-gated swap device via channel I/O server. Swap page table entries.
pmm_swap_out,pmm_swap_in.
Unlocks: Phase 7 (full memory management required for production use).
10.8 Phase 7 — Hardening and Observability
Prerequisite: Phases 4, 5, and 6 complete.
Deliverables:
-
KOMS attribute bus: Expose domain/thread/memory statistics as KOMS attributes. Readable via
zx_attr_getsyscall with a capability. -
Kernel ring buffer: Fixed-size circular log buffer. Capability-gated read via
ep.klogendpoint. Replacesprintkfor kernel diagnostics. -
Capability audit log: Every
cap_mint,cap_derive,cap_revoke, andcap_transferis logged to a dedicated ring buffer. Readable by the system manager domain. -
Syscall fuzz harness: Host-side tool that generates random syscall sequences and validates that the kernel never panics (only returns error codes) on invalid inputs.
-
SMP stress test: Multi-domain IPC stress test exercising the fastpath, work stealing, and domain fault/restart under load.
10.9 Milestone Summary
| Phase | Key Deliverable | Unlocks |
|---|---|---|
| 1 | Trap, time, scheduler | Capability system |
| 2 | Capability tokens and tables | Domain creation |
| 3 | Domains, threads, IPC | Server domains, memory completion |
| 4 | Fault containment, watchdog, MCCK | First server domains |
| 5 | Console, block I/O, filesystem | Full system |
| 6 | Demand paging, CoW, reclaim, swap | Production memory management |
| 7 | Observability, audit, hardening | Production readiness |
End of ZXF-KRN-DESIGN-001 Rev 26h1.0
Kernel Overview
Document Revision: 26h1.0
1. Entry Contract
The kernel receives control from ZXFL with the following guaranteed state:
| Resource | State |
|---|---|
| DAT | On — CR1 holds the ASCE built by the loader |
| Interrupts | Masked — all interrupt classes disabled |
%r2 | HHDM virtual address of zxfl_boot_protocol_t |
%r15 | HHDM virtual address of initial stack top (32 KB loader stack) |
| All other GPRs | Undefined |
The kernel entry point is zxfoundation_global_initialize(zxfl_boot_protocol_t *boot). The first action must be to validate boot->magic == ZXFL_MAGIC. Any other use of the protocol before this check is undefined behavior.
2. Subsystem Table
| Subsystem | Source location | Status |
|---|---|---|
| Early init | zxfoundation/init/ | Active |
| PMM | zxfoundation/memory/pmm.c | Active |
| VMM | zxfoundation/memory/vmm.c | Active |
| Slab | zxfoundation/memory/slab.c | Active |
| kmalloc | zxfoundation/memory/kmalloc.c | Active |
| Heap | zxfoundation/memory/heap.c | Active |
| MMU | arch/s390x/mmu/mmu.c | Active |
| Per-CPU | arch/s390x/cpu/percpu.c | Active |
| qspinlock | arch/s390x/cpu/qspinlock.c | Active |
| Mutex | zxfoundation/sync/mutex.c | Active |
| RW Lock | zxfoundation/sync/rwlock.c | Active |
| Semaphore | zxfoundation/sync/semaphore.c | Active |
| Wait queue | zxfoundation/sync/waitqueue.c | Active |
| RCU | zxfoundation/sync/rcu.c | Active |
| SRCU | zxfoundation/sync/srcu.c | Active |
| kobject | zxfoundation/object/kobject.c | Active |
| printk | zxfoundation/sys/printk.c | Active |
| panic | zxfoundation/sys/panic.c | Active |
| Trap | arch/s390x/trap/ | Active |
| SMP | arch/s390x/cpu/smp.c | Active |
| Scheduler | zxfoundation/sched/ | Active |
| IRQ | arch/s390x/irq/ | Stub |
| Time | arch/s390x/time/ | Stub |
Early Initialization
Document Revision: 26h1.0
Source: zxfoundation/init/main.c
1. Initialization Sequence
zxfoundation_global_initialize performs early initialization in strict order before enabling interrupts or starting APs:
| Step | Action | Notes |
|---|---|---|
| 1 | zxfl_lowcore_setup() | Install kernel new PSWs in the BSP lowcore |
| 2 | diag_setup() + printk_initialize() | Enable console output |
| 3 | Validate boot->magic == ZXFL_MAGIC | Panic if wrong |
| 4 | Validate boot->binding_token | Recompute and compare; panic on mismatch |
| 5 | validate_stack_frame() | Verify ZXVL stack canaries |
| 6 | verify_kernel_checksums() | Re-verify SHA-256 segment digests from HHDM |
| 7 | Print machine/LPAR/CPU info | If ZXFL_FLAG_SYSINFO / ZXFL_FLAG_SMP set |
| 8 | percpu_init_bsp() | Initialize BSP per-CPU block at prefix+0x200 |
| 9 | arch_cpu_features_init(boot) | Detect STFLE facilities, populate feature flags |
| 10 | rcu_init() | Initialize RCU subsystem |
| 11 | pmm_init(boot) | Register usable memory regions; reserve loader/kernel/pool |
| 12 | mmu_init() | Install 8 KB VA-0 lowcore window; scrub identity map; inherit EDAT-1/2 state. Order is mandatory — see §4. |
| 13 | vmm_init() | Set up vmalloc region |
| 14 | slab_init() | Initialize slab caches |
| 15 | kmalloc_init() | Initialize kmalloc size classes |
| 16 | trap_init() | Install program-check new PSW; enable trap handler |
| 17 | smp_init() | Start all APs (SIGP sequence); each AP calls trap_init() |
| 18 | sched_init() | BSP becomes idle (PID 0); spawns kernel_init (PID 1) |
2. Security Checks (Steps 3–6)
These checks run before any subsystem is initialized. A failure at any point calls panic(), which loads a disabled-wait PSW.
Binding token (step 4): The kernel recomputes ZXVL_COMPUTE_TOKEN(stfle_fac[0], ipl_schid) and compares it to boot->binding_token. This ties the running kernel to the specific hardware and IPL device — a protocol struct copied from another machine will fail here.
Stack frame (step 5): The loader writes a two-word canary at boot->kernel_stack_top. The kernel verifies frame[0] == ZXVL_FRAME_MAGIC_A and frame[1] == ZXVL_FRAME_MAGIC_B ^ binding_token. A mismatch indicates stack corruption or an unauthorized loader.
Checksum re-verification (step 6): The kernel re-reads the zxvl_checksum_table_t from kernel_phys_start + ZXVL_CKSUM_TABLE_OFFSET (via HHDM) and recomputes SHA-256 for each PT_LOAD segment. This catches any modification to the kernel image between loader verification and kernel execution.
3. PMM Reservation (Step 10)
pmm_init registers all ZXFL_MEM_USABLE regions from the boot protocol memory map, then marks the following ranges as reserved:
| Range | Reason |
|---|---|
[0, 1 MB) | Lowcore + loader code |
[kernel_phys_start, kernel_phys_end) | Kernel image |
[pool_base, pgtbl_pool_end) | Bootloader page table pool |
Each module's [phys_start, phys_start + size) | Loaded modules |
4. MMU Initialization Ordering Invariant (Step 12)
mmu_init() takes ownership of the bootloader ASCE and replaces the bootloader's
8 GB identity map with a precise 8 KB window at VA 0. This operation has a strict,
unbreakable ordering requirement rooted in z/Architecture hardware behavior.
Why VA 0 Must Always Be Mapped
Every interrupt handler entry stub (trap_pgm_entry, trap_ext_entry, etc.) begins
with:
lg %r1, LC_ASYNC_STACK(0) // effective VA = 0x0350
The zero base register is not an error — it is the only way to load a value before
registers have been saved. Because DAT is active when this runs, VA 0x350 must be
translated successfully. If the mapping is absent even for one instruction cycle while
interrupts are unmasked, a program-check fires, SAVE_FRAME tries to load from
VA 0x350 again, and the CPU enters an infinite Region-first-translation exception
(0x0039) death loop.
Required Sequence in mmu_init()
Step 1: mmu_map_page(VA 0x0000 → PA 0x0000) // build mapping first
Step 2: mmu_map_page(VA 0x1000 → PA 0x1000) // both pages of the lowcore
Step 3: scrub r1[1..2046] // revoke identity map
Step 4: mmu_flush_tlb_local() // make scrub visible to CPU
Steps 1–2 must precede steps 3–4. The new 8 KB mapping is committed into the
live R1 table before any identity entry is removed, so VA 0x350 is always valid.
Can This Be Avoided by Enabling DAT Earlier?
No. The requirement is not a consequence of when DAT is enabled; it comes from
how SAVE_FRAME accesses the lowcore. Even if ZXFL enabled DAT internally and
passed the kernel a fully virtual address space, the kernel's entry.S would still
execute lg %r1, 0x350(0) and still require VA 0x350 to be mapped. This is
standard z/Architecture operating system design — Linux s390x, z/VM, and z/OS all
maintain an equivalent lowcore window at virtual address 0 for the same reason.
See docs/src/kernel/trap.md for the full architectural rationale.
System Check (syschk)
Document Revision: 26h1.3
Status: Active
1. Overview
The System Check subsystem (syschk) is the kernel's mechanism for halting the system when a condition is detected from which execution cannot safely continue.
The halt path acquires no locks, calls no kernel subsystems, and dereferences no kernel data structures. It is safe to call from any context: exception handlers, IRQ handlers, early init, or a state where kernel memory is corrupt.
2. Error Code Encoding
Every system check is identified by a 16-bit code with three fields:
15 12 11 8 7 0
┌────────┬──────────┬───────────────┐
│ CLASS │ DOMAIN │ TYPE │
│ 4 b │ 4 b │ 8 b │
└────────┴──────────┴───────────────┘
| Field | Bits | Purpose |
|---|---|---|
| CLASS | 15–12 | Severity class |
| DOMAIN | 11–8 | Originating subsystem |
| TYPE | 7–0 | Specific condition within the domain |
2.1 Severity Classes
| Class | Value | Behavior |
|---|---|---|
| FATAL | 0xF | Always halts |
| CRITICAL | 0xC | Always halts |
| WARNING | 0x3 | Always halts |
All classes halt unconditionally. The class field exists for post-mortem triage, not for runtime branching.
2.2 Domains
| Domain | Value | Subsystem |
|---|---|---|
| CORE | 0x0 | Core kernel / initialization |
| MEM | 0x1 | Memory subsystem |
| SYNC | 0x2 | Synchronization primitives |
| ARCH | 0x3 | Architecture / hardware |
| SCHED | 0x4 | Scheduler |
| IO | 0x5 | I/O subsystem |
3. Halt Sequence
zx_system_check(code, msg)
│
▼
arch_local_irq_disable()
│
▼
g_halting set? ──YES──► arch_sys_halt()
│
│ NO
▼
g_halting = 1
│
▼
write zx_crash_record_t to lowcore + 0x1400
(magic, code, PSW snapshot, reason string)
│
▼
raw SIGP STOP loop over g_cpu_map[]
(boot protocol array; no percpu_areas lookup)
CC=2 retried; CC=3 skipped
│
▼
arch_sys_halt() ← disabled-wait PSW; machine stops
4. Crash Record
Before halting, the issuing CPU writes a zx_crash_record_t to a fixed
offset (0x1400) within the BSP lowcore. The lowcore is a fixed physical
address, always mapped, and accessible regardless of kernel heap or DAT state.
Offset Size Field
------ ---- -----
0x00 8 magic (0x5A584352554E4348 "ZXCRUNCH")
0x08 2 code (zx_syschk_code_t)
0x0A 6 pad
0x10 8 psw_mask (EPSW at time of syschk)
0x18 8 psw_addr (0; not available from EPSW)
0x20 128 msg (NUL-terminated reason string)
The record is read post-mortem by a debugger or operator console. It is not printed to the console during the halt sequence.
5. Re-entrancy
If a second system check fires on any CPU while a halt is already in progress,
the re-entrant call detects g_halting immediately after IRQ disable and
proceeds directly to arch_sys_halt(). The crash record is not overwritten.
g_halting is a volatile int, not an atomic. If the memory subsystem is
corrupt, atomic operations cannot be trusted.
6. SMP Teardown
The halt path iterates g_cpu_map[] — the boot protocol's CPU map, registered
at init time via zx_syschk_register_cpu_map(). This array is loader-written,
physically contiguous, and never freed. It does not depend on percpu_areas[]
or any kernel allocator.
sigp() is a single inline assembly instruction. It acquires no locks.
CC=2 (busy) is retried in a tight loop. CC=3 (not operational) is skipped.
7. WARNING-Class Codes
WARNING codes halt unconditionally. There is no filter mechanism. If a
subsystem needs to log a recoverable condition, it should call printk
directly and not use zx_system_check.
8. Revision History
| Revision | Change |
|---|---|
| 26h1.3 | Removed filter API; all classes halt unconditionally; crash record written to lowcore; raw SIGP loop; no printk on halt path |
| 26h1.2 | Re-entrant guard moved first; SMP teardown before printk; static BSS message buffer |
| 26h1.1 | Initial release |
Per-CPU Data
Document Revision: 26h1.3
Sources: include/arch/s390x/cpu/lowcore.h, include/zxfoundation/percpu.h,
arch/s390x/cpu/percpu.c
1. Layout
Each CPU's prefix area (lowcore) is a monolithic 8 KB block (two contiguous
physical pages). The physical address of this block is loaded into the hardware
prefix register via SPX. The prefix register transparently remaps absolute address
0x0000–0x1FFF to the CPU's own physical lowcore for all absolute-mode accesses.
The layout unifies hardware-assigned fields and software-defined per-CPU data into a
single structure (zx_lowcore_t):
Physical Prefix Area (8 KB)
┌──────────────────────────────┐ 0x000
│ Hardware Lowcore │ PSWs, interrupt codes, save areas (PoP §4)
├──────────────────────────────┤ 0x400 ← LC_PERCPU_OFFSET
│ Software Per-CPU Block │ prefix_base, cpu_id, lock_depth,
│ (zx_percpu_t percpu) │ MCS nodes, RCU state, PCP caches
├──────────────────────────────┤ 0x1200
│ Hardware Save Areas │ GPRs, FPRs, CRs, ARs
└──────────────────────────────┘ 0x2000
2. Access — Current CPU
To access the current CPU's own per-CPU data, the kernel uses zx_lowcore(),
which returns the HHDM-mapped pointer to the active lowcore. Because the prefix
register already routes absolute-address-0 to this CPU's physical lowcore, and the
HHDM maps physical 0 to CONFIG_KERNEL_VIRT_OFFSET, zx_lowcore() always resolves
to the correct CPU without needing the prefix register value at all.
| Macro | Description |
|---|---|
percpu_get(field) | Read a field from the current CPU's percpu block |
percpu_set(field, val) | Write a field to the current CPU's percpu block |
percpu_inc(field) | Increment a field in place |
percpu_dec(field) | Decrement a field in place |
percpu_ptr_to(field) | Pointer to a field in the current CPU's block |
3. Access — Other CPUs (zx_lowcore_cpu)
3.1 The Hardware Prefix Aliasing Bug
Accessing another CPU's lowcore by index into a global pointer array is deceptively
dangerous on s390x. Consider the global array __percpu_areas_raw[] where:
__percpu_areas_raw[0]= HHDM pointer to BSP lowcore =CONFIG_KERNEL_VIRT_OFFSET + 0__percpu_areas_raw[1]= HHDM pointer to AP-1 lowcore =CONFIG_KERNEL_VIRT_OFFSET + P
When AP-1 (whose prefix register is P) reads a value from address
CONFIG_KERNEL_VIRT_OFFSET + 0 (i.e., the BSP's HHDM lowcore), the MMU translates
it to physical address 0. The prefix register then remaps physical 0 to physical
P — so AP-1 silently reads its own lowcore, not the BSP's.
Symmetrically, when AP-1 reads from CONFIG_KERNEL_VIRT_OFFSET + P, the MMU
translates it to physical P. The prefix register remaps physical P to physical 0
— so AP-1 silently reads the BSP's lowcore.
The result: every AP's cross-CPU lowcore lookup is silently swapped with the BSP's. IPI delivery, RCU quiescent-state tracking, and PMM per-CPU page caches all operated on the wrong CPU's data. The system "mostly worked" because the perfect symmetry of the swap caused IPIs to still reach all CPUs, masking the corruption.
3.2 The Safe Accessor: zx_lowcore_cpu(cpu)
__percpu_areas_raw[] must never be accessed directly. Use zx_lowcore_cpu(cpu)
defined in include/zxfoundation/percpu.h, which applies an inverse prefix swap in
software:
#define zx_lowcore_cpu(cpu) \
({ \
zx_lowcore_t *__lc = __percpu_areas_raw[(cpu)]; \
zx_lowcore_t *__res = __lc; \
if (__lc) { \
uint64_t __target_real = (uint64_t)__lc - CONFIG_KERNEL_VIRT_OFFSET;\
uint64_t __my_prefix = zx_lowcore()->percpu.prefix_base; \
if (__target_real == __my_prefix) \
__res = (zx_lowcore_t *)CONFIG_KERNEL_VIRT_OFFSET; \
else if (__target_real == 0) \
__res = (zx_lowcore_t *)(CONFIG_KERNEL_VIRT_OFFSET + __my_prefix);\
} \
__res; \
})
How it works: if the target's physical address matches my_prefix, the hardware
would have swapped it to 0, so we manually redirect to HHDM + 0 (the BSP). If the
target's physical address is 0, the hardware would have swapped it to my_prefix, so
we redirect to HHDM + my_prefix. Any other CPU is unaffected (no swap applies).
The cross-CPU access macros all go through this accessor:
| Macro | Description |
|---|---|
percpu_get_on(cpu, field) | Read from another CPU's percpu block |
percpu_set_on(cpu, field, val) | Write to another CPU's percpu block |
percpu_ptr_on(cpu, field) | Pointer to a field in another CPU's block |
4. Initialization
| Function | When Called | Effect |
|---|---|---|
percpu_init_bsp() | Once, early in main.c | Registers BSP lowcore (physical 0x0) in __percpu_areas_raw[0] |
percpu_init_ap(cpu_id, cpu_addr, node) | Once per AP in smp_init() | Allocates 8 KB (order-1), installs prefix via SPX, registers in __percpu_areas_raw[cpu_id] |
5. Fields (zx_percpu_t)
| Field | Type | Purpose |
|---|---|---|
prefix_base | uint64_t | Physical address of this CPU's lowcore (used by zx_lowcore_cpu) |
cpu_id | uint16_t | Logical CPU ID (0 = BSP) |
cpu_addr | uint16_t | z/Arch CPU address (STAP result); used for SIGP |
lock_depth | uint32_t | qspinlock nesting depth |
lock_nodes[MAX_LOCK_DEPTH] | mcs_node_t[] | MCS queue nodes for qspinlock |
rcu_gp_seq | uint64_t | RCU grace-period sequence (written by BSP) |
rcu_qs_seq | uint64_t | RCU quiescent-state sequence (written by this CPU) |
in_rcu_read_side | uint8_t | 1 if inside rcu_read_lock() |
ipi_pending_count | uint32_t | Pending IPI completion counter |
ap_stack_top | uint64_t | Initial AP stack pointer (physical, set before SIGP Restart) |
pcp[ZONE_MAX] | pmm_pcplist_t[] | Per-CPU PMM order-0 page caches, one per memory zone |
6. Assembly Offsets
Key lowcore offsets used by entry.S and head64.S are defined as named constants in
include/arch/s390x/cpu/lowcore.h and verified at compile time by _Static_assert:
| Constant | Value | Field |
|---|---|---|
LC_ASYNC_STACK | 0x0350 | zx_lowcore_t::async_stack |
LC_MCCK_STACK | 0x0368 | zx_lowcore_t::mcck_stack |
LC_KERNEL_STACK | 0x0348 | zx_lowcore_t::kernel_stack |
LC_RESTART_STACK | 0x0360 | zx_lowcore_t::restart_stack |
LC_KERNEL_ASCE | 0x0388 | zx_lowcore_t::kernel_asce |
LC_PERCPU_OFFSET | 0x0400 | zx_lowcore_t::percpu |
LC_CPU_ID_OFFSET | 0x0408 | zx_percpu_t::cpu_id (within percpu block) |
Interrupt Subsystem
Document Revision: 26h1.0
Subsystem: arch/s390x/trap, zxfoundation/irq
1. Overview
The interrupt subsystem handles all four z/Architecture interrupt classes delivered to the kernel: program check, external, I/O, and machine check. It is structured in two layers:
- Architecture layer (
arch/s390x/trap/) — low-level entry stubs and class-specific C handlers that decode hardware state from the lowcore. - Generic layer (
zxfoundation/irq/) — a flat IRQ descriptor table that routes decoded interrupt codes to registered handlers.
Supervisor calls (SVC) are reserved for the future syscall layer and are not dispatched through this subsystem.
2. Interrupt Delivery on z/Architecture
When an interrupt fires, the hardware atomically:
- Saves the current PSW into the class-specific old PSW slot in the lowcore (prefix area).
- Writes interrupt parameters into fixed lowcore fields.
- Loads the class-specific new PSW slot, transferring control to the kernel entry stub.
Hardware fires interrupt
│
▼
Save current PSW → lowcore old PSW slot (0x0130/0x0150/0x0160/0x0170)
│
▼
Write interrupt parameters to lowcore (pgm_code, ext_int_code, …)
│
▼
Load new PSW slot (0x01B0/0x01D0/0x01E0/0x01F0) → entry stub
The new PSW slots are installed by zx_lowcore_setup_late() after DAT is
enabled. Before that point they hold disabled-wait sentinels.
3. Lowcore Interrupt Slots
| Class | Old PSW | New PSW | Parameter fields |
|---|---|---|---|
| External | 0x0130 | 0x01B0 | ext_int_code (0x0086) |
| Program check | 0x0150 | 0x01D0 | pgm_code (0x008E) |
| Machine check | 0x0160 | 0x01E0 | mcck_interruption_code (0x00E8) |
| I/O | 0x0170 | 0x01F0 | subchannel_nr (0x00BA) |
4. Entry Stubs (arch/s390x/trap/entry.S)
Each entry stub performs the following sequence without touching any kernel data structure:
entry stub
│
├─ Load dedicated stack pointer from lowcore
│ async_stack (0x0350) for PGM / EXT / IO
│ mcck_stack (0x0368) for MCCK
│
├─ Allocate 160-byte ABI save area + 160-byte interrupt frame
│
├─ Store GPRs r0–r15 into frame.gprs[0..15]
│
├─ Copy old PSW (mask + addr) from lowcore into frame.psw_mask/psw_addr
│
├─ Set %r2 = &frame (first argument to C handler)
│
├─ BRASL → C handler (do_pgm_check / do_ext_interrupt / …)
│
└─ Restore GPRs r0–r14, LPSWE from frame.psw_mask
The machine-check stub uses a separate stack (mcck_stack) so that the
handler runs even if the async stack is corrupt.
4.1 Interrupt Frame Layout
Offset Size Field
------ ---- -----
0x00 128 gprs[0..15] — GPRs at interrupt time
0x80 8 psw_mask — old PSW mask word
0x88 8 psw_addr — old PSW instruction address
Total: 160 bytes (IRQ_FRAME_SIZE).
5. IRQ Number Space
The generic layer uses a 16-bit IRQ number partitioned by interrupt class:
0x0000 – 0x00FF Program check codes (pgm_code & 0x7FFF)
0x0100 – 0x01FF External codes (ext_int_code)
0x0200 – 0x02FF I/O subchannel numbers (subchannel_nr & 0xFF)
0x0300 – 0x03FF Machine-check sub-codes (mcic >> 56)
The descriptor table has ZX_IRQ_NR_MAX = 0x400 entries.
6. IRQ Descriptor Table (zxfoundation/irq/)
The table is a flat, statically-allocated BSS array. Each entry holds:
- A handler function pointer (
irq_handler_t). - An opaque
datapointer forwarded to the handler. flags(ZX_IRQF_SHARED,ZX_IRQF_DISABLED).- A
countfield incremented on every dispatch.
6.1 Dispatch Path
C handler (do_pgm_check / do_ext_interrupt / …)
│
├─ Read hardware code from lowcore
├─ Compute irq = ZX_IRQ_BASE_* + code
└─ irq_dispatch(irq, frame)
│
├─ Bounds check irq < ZX_IRQ_NR_MAX
├─ Increment desc->count
└─ Call desc->handler (or default handler if NULL)
6.2 Default Handler Behavior
| IRQ range | Default action |
|---|---|
| PGM (0x0–0xFF) | zx_system_check(ARCH_UNHANDLED_TRAP) — fatal |
| EXT (0x100–0x1FF) | printk + drop |
| IO (0x200–0x2FF) | printk + drop |
| MCCK (0x300–0x3FF) | zx_system_check(ARCH_MCHECK) — fatal |
7. Machine-Check Special Case
Before dispatching, do_mcck_interrupt checks the system damage bit
(bit 0) of the MCIC. If set, zx_system_check() is called immediately —
the descriptor table itself may reside in damaged storage and cannot be
trusted.
8. Registration API
irq_register(irq, handler, data, flags) → 0 or -1
irq_unregister(irq)
irq_dispatch(irq, frame)
irq_get_desc(irq) → const irq_desc_t *
irq_register and irq_unregister are not SMP-safe at this revision.
They must be called during single-threaded initialization or with external
serialization.
9. Revision History
| Revision | Change |
|---|---|
| 26h1.0 | Initial release |
Memory Management
Document Revision: 26h1.0
ZXFoundation™'s memory management is organized in four layers:
┌──────────────────────────────────────────┐
│ kmalloc / kfree (general-purpose) │
├──────────────────────────────────────────┤
│ Slab allocator (fixed-size caches) │
├──────────────────────────────────────────┤
│ VMM (virtual address space)│
├──────────────────────────────────────────┤
│ PMM (physical frames) │
├──────────────────────────────────────────┤
│ MMU (hardware DAT tables) │
└──────────────────────────────────────────┘
| Page | Contents |
|---|---|
| PMM | Zone-aware buddy allocator, page descriptors |
| VMM | Virtual address space, VMA red-black tree, vmalloc |
| Slab & Kmalloc | Fixed-size object caches, general allocator |
Physical Memory Manager (PMM)
Document Revision: 26h1.0
Source: zxfoundation/memory/pmm.c
1. Zones
| Zone | Physical range | Purpose |
|---|---|---|
ZONE_DMA | [0, 16 MB) | Channel I/O buffers (31-bit CDA constraint) |
ZONE_NORMAL | [16 MB, RAM limit) | General kernel allocations |
Allocations without ZX_GFP_DMA are served from ZONE_NORMAL first. If ZONE_NORMAL is exhausted and ZX_GFP_DMA_FALLBACK is set, the PMM falls back to ZONE_DMA.
2. Buddy Allocator
Free physical frames are managed in a buddy system. Block sizes are powers of two, from order 0 (4 KB) to order 10 (4 MB). Each order has a free list of blocks.
Allocation — walk the free list at the requested order. If empty, split a block from the next higher order. Repeat until a block is found or all orders are exhausted.
Deallocation — compute the buddy PFN (pfn ^ (1 << order)). If the buddy is free at the same order, coalesce and recurse upward.
Free list links use PFN-based intrusive fields (buddy_next) rather than virtual pointers, ensuring correctness across HHDM translations.
3. Page Descriptor (zx_page_t)
Each physical frame has a 32-byte descriptor. The descriptor array is mapped contiguously in the HHDM. 32 bytes places 128 descriptors per 4 KB frame — a deliberate cache-line optimization.
| Field | Description |
|---|---|
refcount | Atomic reference count; 0 = free |
order | Current buddy order of this block |
flags | Zone membership, compound page markers |
buddy_next | PFN of next free block in the buddy list |
4. GFP Flags
| Flag | Meaning |
|---|---|
ZX_GFP_NORMAL | Standard allocation from ZONE_NORMAL |
ZX_GFP_DMA | Must allocate from ZONE_DMA |
ZX_GFP_DMA_FALLBACK | Try ZONE_NORMAL, fall back to ZONE_DMA |
ZX_GFP_ZERO | Zero-fill the allocated pages |
5. SMP Safety & Per-CPU Lists (PCP)
Each zone has a dedicated ticket spinlock. To reduce contention, order-0 pages are cached in Per-CPU Lists (PCP).
- Allocation: CPUs pull from local PCP first without locking (IRQs disabled).
- Drain: Global operations (like
pmm_reserve_range) trigger a global PCP drainage via SIGP Emergency Signals (IPI) to all other CPUs. This ensures no CPU holds a 'stale' cached page that should be reserved.
6. HHDM Side Reinforcement
The Direct Physical Mapping (HHDM) is validated during initialization:
- Validation:
pmm_verify_hhdm()checks translation consistency against the loader's memory map. It verifies that every usable physical page is correctly mapped to its HHDM virtual counterpart. - EDAT Compliance: Verifies Enhanced-DAT (EDAT-1/2) 1 MB and 2 GB page usage to optimize memory performance and reduce TLB pressure.
- Consistency: The loader must ensure that the mapping covers the entire physical memory range described in the boot protocol, rounding up to the nearest Region-3 or Segment boundary as required by the z/Architecture DAT structure.
7. Initialization
pmm_init(boot) is called once during early init:
- Walk
boot->mem_map[]and register allZXFL_MEM_USABLEregions. - Mark reserved ranges via Surgical Reservation:
- Lowcore/Artifacts:
[0, 1 MB)is always reserved to protect lowcore and loader leftovers. - Kernel Image:
[kernel_phys_start, kernel_phys_end)is marked as critical. - Page Table Pool:
[kernel_phys_end, pgtbl_pool_end)is reserved to protect active DAT tables. - PMM Metadata: The
zx_mem_mapdescriptor array itself.
- Lowcore/Artifacts:
- Insert all non-reserved
USABLEframes into the buddy free lists.
[!IMPORTANT] Surgical Reservation prevents "Zone Exhaustion" bugs where a large bootloader page pool could otherwise wipe out all available frames in
ZONE_DMA(under 16 MB).
Virtual Memory Manager (VMM)
Document Revision: 26h1.0
Source: zxfoundation/memory/vmm.c
1. Address Space Regions
| Region | Base | Purpose |
|---|---|---|
| HHDM | 0xFFFF800000000000 | Linear physical memory map (built by loader, read-only to VMM) |
| vmalloc | 0xFFFFC00000000000 | Dynamically mapped kernel memory |
2. Virtual Memory Areas (VMAs)
Each allocated virtual range is described by a vm_area_t:
| Field | Description |
|---|---|
va_start | Start of virtual range (page-aligned) |
va_end | End of virtual range (exclusive) |
flags | VM_READ, VM_WRITE, VM_EXEC |
rb_node | Red-Black Tree node for $O(\log n)$ lookup |
VMAs are indexed in a Red-Black Tree (rbtree.h). A one-entry MRU cache in vm_space_t provides an $O(1)$ fast path for sequential access patterns.
3. vmalloc
vmm_alloc(size, flags) reserves a contiguous virtual range in the vmalloc region and maps it with PMM-allocated frames:
vmm_alloc(size, flags)
│
├─ Round size up to page boundary
├─ Bump-allocate virtual range from vmalloc region
├─ Insert VMA into red-black tree
├─ For each page in range:
│ ├─ pmm_alloc_page(flags)
│ └─ mmu_map_page(kernel_pgtbl, va, pa, prot)
└─ Return va_start
Frames backing a vmalloc range are not required to be physically contiguous.
4. Large-Object Heap (kheap)
For allocations larger than 8 KB, kheap_alloc calls vmm_alloc to back the range with PMM frames. A 64-bit HEAP_MAGIC canary guards the allocation header against buffer underflows.
5. MMU Integration
The VMM calls mmu_map_page (4 KB), mmu_map_large_page (1 MB, if EDAT-1 available), or mmu_map_huge_page (2 GB, if EDAT-2 available) to install PTEs. TLB coherency is handled automatically by the IPTE instruction — no software IPI is required.
Slab Allocator & kmalloc
Document Revision: 26h1.1
Source: zxfoundation/memory/slab.c, zxfoundation/memory/kmalloc.c
1. Slab Allocator
The slab allocator provides fixed-size object caches to amortize the cost of frequent small allocations (VMAs, sync primitives, capability tables, etc.). It uses a magazine-depot architecture for lock-free per-CPU fast paths and SMP-safe bulk operations through the depot.
1.1 Architecture
kmem_cache_t
├─ obj_size (8-byte aligned)
├─ storage_key (s390x storage key for all backing pages)
├─ depot_lock (spinlock protecting the depot lists)
├─ full_mags (depot: magazines with MAG_SIZE objects ready)
├─ empty_mags (depot: magazines ready to be refilled)
├─ partial_slabs (slab pages with free objects remaining)
├─ full_slabs (slab pages fully allocated)
└─ cpu_mags[MAX_CPUS] (per-CPU active magazine pointer)
Each magazine holds up to MAG_SIZE (31) object pointers.
Each slab is one PMM page; the slab header, free-index stack, and object array are all embedded within that page.
1.2 Fast Path (per-CPU, no lock)
alloc:
IRQs disabled
if cpu_mag.count > 0 → pop and return
else → magazine_swap(fill) → pop and return
free:
IRQs disabled
if cpu_mag.count < MAG_SIZE → push and return
else → magazine_swap(drain) → push and return
IRQs are disabled for the duration of the fast path. No lock is taken; the per-CPU magazine is accessed exclusively.
1.3 Slow Path (depot, with lock)
magazine_swap acquires depot_lock. Two sub-paths:
Fill (need objects):
1. full_mags non-empty?
yes → promote to CPU slot immediately (fast fill)
no → obtain empty shell from empty_mags (or alloc from mag_cache)
→ cache_refill_magazine (may drop+reacquire depot_lock for PMM)
→ move filled shell to full_mags → promote to CPU slot
Drain (returning a full CPU magazine):
1. Push CPU magazine to full_mags
2. Pull empty shell from empty_mags into CPU slot (or set to nullptr)
1.4 Slab Refill & Lock Discipline
cache_refill_magazine is called with depot_lock held.
When a new slab page must be allocated from the PMM:
drop depot_lock
pmm_alloc_page() ← PMM zone lock acquired/released here
reacquire depot_lock
re-validate partial_slabs (another CPU may have added one in the window)
This ensures the PMM zone lock and depot_lock are never held simultaneously, eliminating the lock-inversion hazard present in earlier revisions.
1.5 Node Lifecycle
Magazine nodes cycle between:
empty_mags ──fill──▶ (detached, being filled) ──▶ full_mags ──promote──▶ cpu_mag
cpu_mag ──drain──▶ full_mags empty_mags ◀── (pulled empty shell)
list_del_init is used for all magazine-node removals so nodes are always in a self-pointing state when not on a list, making re-insertion safe without re-initialization.
2. kmalloc
kmalloc(size) routes requests to the appropriate slab cache based on size class.
| Size range | Backing |
|---|---|
| ≤ 8 KB | Slab cache (power-of-two class) |
| > 8 KB | vmalloc → vmm_alloc |
kfree(ptr) returns the object to its originating cache.
A header embedded before each allocation records the cache pointer and a canary for use-after-free detection.
3. Initialization Order
pmm_init() ← must run first; slab needs PMM pages
slab_init() ← bootstraps cache_cache and mag_cache from a single PMM page
kmalloc_init() ← registers size-class caches via kmem_cache_create
vmm_notify_slab_ready() ← switches VMM early allocator to kmalloc
4. Strict Requirements
| ID | Requirement |
|---|---|
| SLAB-1 | kmem_cache_alloc must not be called from hard-IRQ context unless the cache was created with atomic support. Use kmalloc(ZX_GFP_ATOMIC) from IRQ context. |
| SLAB-2 | kmem_cache_free must only be called with a pointer returned by kmem_cache_alloc on the same cache. Cross-cache free is undefined behavior. |
| SLAB-3 | kmem_cache_destroy must only be called after all objects have been returned. Outstanding objects at destroy time trigger a kernel panic. |
| SLAB-4 | depot_lock must never be held when calling into the PMM or any allocator that may itself acquire a zone lock. Use the lock-drop protocol in cache_refill_magazine. |
SMP
Document Revision: 26h1.0
Source: arch/s390x/cpu/
1. CPU Detection
The bootloader detects CPUs by issuing SIGP Sense (order 0x01) to each address in [0, ZXFL_CPU_MAP_MAX). A condition code of 3 means "not operational" — the address is unoccupied. CC 0, 1, or 2 means the CPU exists and is recorded in proto->cpu_map[].
The BSP address is read with STAP (Store CPU Address).
At kernel entry, proto->cpu_count contains the number of detected CPUs and proto->bsp_cpu_addr identifies the boot processor.
2. AP State at Handover
All APs are in the stopped state when the kernel receives control. The bootloader never starts APs. The kernel BSP is responsible for starting each AP:
| Step | Action |
|---|---|
| 1 | Allocate a private prefix area (4 KB, page-aligned) for the AP |
| 2 | Allocate a private stack for the AP |
| 3 | Install interrupt new PSWs in the AP's prefix area |
| 4 | SIGP Initial CPU Reset — clear the AP's state |
| 5 | SIGP Set Prefix — point the AP's prefix register at its private lowcore |
| 6 | SIGP Restart — start the AP at the restart new PSW in its prefix area |
Note: AP startup is not yet implemented. The current kernel halts after BSP initialization.
3. Per-CPU Data
Each CPU requires its own:
- Prefix area (4 KB) — private lowcore with correct new PSWs. Set via
SPX. - Stack — the AP must not use the BSP stack or the loader stack.
- Per-CPU variables — accessed via the prefix register offset (analogous to
%gson x86).
4. TLB Coherency
z/Architecture hardware handles TLB coherency automatically via the IPTE (Invalidate Page Table Entry) instruction. IPTE atomically clears a PTE and broadcasts a TLB purge to all CPUs that have the affected ASCE loaded. No software IPI is required for TLB shootdowns.
mmu_ipte(va):
ipte %r0, va ← serialising, hardware-broadcast
PTLB (Purge TLB) flushes the entire local TLB and should only be used during address-space teardown. For single-page invalidation in a running SMP kernel, always use IPTE.
5. SIGP Reference
| Order | Code | Use |
|---|---|---|
| Sense | 0x01 | Query CPU state |
| External Call | 0x02 | Send external interrupt to CPU |
| Emergency Signal | 0x03 | Send emergency signal |
| Initial CPU Reset | 0x06 | Clear CPU state before restart |
| Set Prefix | 0x0D | Set prefix register on target CPU |
| Store Status | 0x0E | Save CPU registers to prefix area |
| Set Architecture | 0x12 | Switch to z/Architecture mode |
| Restart | 0x06 + Restart PSW | Start AP at restart new PSW |
PSW Manager
Document Revision: 26h1.0
Subsystem: arch/s390x/cpu/psw
1. Overview
The PSW (Program Status Word) manager provides a single, authoritative
definition of all z/Architecture PSW mask constants and new-PSW lowcore
offsets. Prior to this subsystem, constants were duplicated across
zxconfig.h and lowcore.h under different names, and assembly files
hardcoded incorrect bit patterns.
All consumers — C translation units, assembly files, the ZXFL loader, and
the kernel — include a single header: arch/s390x/cpu/psw.h.
2. PSW Mask Word Layout
The z/Architecture PSW is 16 bytes. The first 8 bytes are the mask word; the second 8 bytes are the instruction address.
Bit 0 PER mask
Bit 5 DAT (address translation enable)
Bit 6 I/O interrupt mask
Bit 7 External interrupt mask
Bit 12 Machine-check mask
Bit 14 Wait state
Bit 15 Problem state (user mode)
Bits 16-17 Address space control (ASC)
Bit 31 EA — required for 64-bit addressing
Bit 32 BA — required for 64-bit addressing
Bits not listed above are reserved and must be zero. Setting a reserved bit causes a Specification Exception when the PSW is loaded via LPSWE.
3. Defined Constants
3.1 Bit Masks
| Constant | Value | Description |
|---|---|---|
PSW_BIT_DAT | 0x0400000000000000 | Address translation enable |
PSW_BIT_IO | 0x0200000000000000 | I/O interrupt mask |
PSW_BIT_EXT | 0x0100000000000000 | External interrupt mask |
PSW_BIT_MCCK | 0x0008000000000000 | Machine-check mask |
PSW_BIT_WAIT | 0x0002000000000000 | Wait state |
PSW_BIT_PSTATE | 0x0001000000000000 | Problem state (user mode) |
PSW_BIT_HOME_SPACE | 0x0000C00000000000 | Home space addressing mode |
PSW_BIT_EA | 0x0000000100000000 | Extended addressing (64-bit) |
PSW_BIT_BA | 0x0000000080000000 | Basic addressing (64-bit) |
3.2 Composite Masks
| Constant | Value | Description |
|---|---|---|
PSW_ARCH_BITS | 0x0000000180000000 | EA|BA — 64-bit mode, no other bits set |
PSW_MASK_KERNEL | 0x0000000180000000 | Supervisor, DAT off, all interrupts disabled |
PSW_MASK_KERNEL_DAT | 0x0400C00180000000 | Supervisor, DAT on (Home Space), all interrupts disabled |
PSW_MASK_DISABLED_WAIT | 0x0002000180000000 | Wait state, DAT off, all interrupts disabled |
3.3 New PSW Lowcore Offsets
These are the physical offsets within the lowcore (prefix area) where the hardware loads the PSW on each interrupt class (PoP SA22-7832 §4.3.3).
| Constant | Offset | Interrupt class |
|---|---|---|
PSW_LC_RESTART | 0x01A0 | Restart |
PSW_LC_EXTERNAL | 0x01B0 | External |
PSW_LC_SVC | 0x01C0 | Supervisor call |
PSW_LC_PROGRAM | 0x01D0 | Program check |
PSW_LC_MCCK | 0x01E0 | Machine check |
PSW_LC_IO | 0x01F0 | I/O |
Note: These offsets are distinct from the old PSW save slots (0x0120–0x0170) and from the interrupt parameter area (0x0080–0x00C0).
4. Boot Initialization
The ZXFL loader prepares the memory tables, registers the Home Space ASCE in CR13 and the Primary Space ASCE in CR1, and directly transitions to DAT-on mode using a PSW_MASK_KERNEL_DAT PSW target before passing control to the kernel.
Thus, the kernel boots with DAT active and executes completely in Home-Space. The legacy psw_install_new_psws() and zx_lowcore_setup_pre_dat() methods have been removed because the pre-DAT boot window is bypassed by the loader.
During early kernel initialization, zx_lowcore_setup_late() is called to install the live interrupt handler entry points directly into the HHDM-mapped lowcore.
Synchronization Primitives
Document Revision: 26h1.0
Source: zxfoundation/sync/, include/zxfoundation/spinlock.h, include/zxfoundation/atomic.h
1. Atomic Operations
include/zxfoundation/atomic.h provides atomic_t (32-bit) and atomic64_t (64-bit) types with the standard load/store/add/sub/cmpxchg operations, implemented using z/Architecture's CS (Compare and Swap) and CSG (Compare and Swap, 64-bit) instructions.
2. Spinlock
include/zxfoundation/spinlock.h provides a ticket spinlock. Ticket spinlocks guarantee FIFO ordering, preventing starvation on highly contended locks.
| Function | Description |
|---|---|
spin_lock(lock) | Acquire; busy-wait with DIAG 44 (yield hint) |
spin_unlock(lock) | Release |
spin_lock_irqsave(lock, flags) | Acquire + disable interrupts, save PSW mask |
spin_unlock_irqrestore(lock, flags) | Release + restore PSW mask |
irqsave/irqrestore variants are required whenever a lock may be acquired from both process context and interrupt context.
3. Mutex
zxfoundation/sync/mutex.c — a sleeping mutex backed by a wait queue. Suitable for contexts where sleeping is permitted (not interrupt handlers).
| Function | Description |
|---|---|
mutex_lock(m) | Acquire; sleep if contended |
mutex_trylock(m) | Non-blocking acquire; returns 0 on failure |
mutex_unlock(m) | Release; wake one waiter |
4. Reader-Writer Lock
zxfoundation/sync/rwlock.c — allows multiple concurrent readers or one exclusive writer.
| Function | Description |
|---|---|
rwlock_read_lock(rw) | Acquire shared read access |
rwlock_read_unlock(rw) | Release read access |
rwlock_write_lock(rw) | Acquire exclusive write access |
rwlock_write_unlock(rw) | Release write access |
5. Semaphore
zxfoundation/sync/semaphore.c — counting semaphore.
| Function | Description |
|---|---|
sem_init(s, count) | Initialize with initial count |
sem_wait(s) | Decrement; sleep if count is 0 |
sem_post(s) | Increment; wake one waiter |
6. Wait Queue
zxfoundation/sync/waitqueue.c — a list of sleeping tasks waiting for a condition.
| Function | Description |
|---|---|
waitqueue_init(wq) | Initialize |
waitqueue_wait(wq, condition) | Sleep until condition is true |
waitqueue_wake_one(wq) | Wake the first waiter |
waitqueue_wake_all(wq) | Wake all waiters |
7. RCU
zxfoundation/sync/rcu.c — Read-Copy-Update. Currently a stub; rcu_read_lock/rcu_read_unlock are no-ops and synchronize_rcu returns immediately.
RCU and SRCU
Document Revision: 26h1.1
Source: zxfoundation/sync/rcu.c, zxfoundation/sync/srcu.c
1. RCU
Read-Copy-Update for a non-preemptive kernel. A quiescent state (QS) occurs whenever a CPU is not inside an rcu_read_lock() section.
Read Side
| Function | Description |
|---|---|
rcu_read_lock() | Enter read-side critical section (compiler barrier only) |
rcu_read_unlock() | Exit read-side critical section |
rcu_dereference(p) | Safely read an RCU-protected pointer |
rcu_assign_pointer(p, v) | Safely publish a new pointer |
Write Side
| Function | Description |
|---|---|
call_rcu(head, fn) | Register a callback for after the next grace period |
synchronize_rcu() | Block until all pre-existing readers have completed, then drain callbacks |
rcu_report_qs() | Report a quiescent state for the current CPU |
Grace Period Mechanism
synchronize_rcu():
1. Increment gp_seq
2. Broadcast new gp_seq to all per-CPU rcu_gp_seq fields
3. Spin until every CPU's rcu_qs_seq == gp_seq
4. Drain callback list
rcu_report_qs() must be called from the idle loop and any long-running non-read-side context.
2. SRCU
Sleepable RCU — allows read-side critical sections to sleep. Each SRCU domain (srcu_struct_t) is independent.
Read Side
| Function | Description |
|---|---|
srcu_read_lock(s) | Enter SRCU read section; returns slot index |
srcu_read_unlock(s, idx) | Exit SRCU read section |
Write Side
| Function | Description |
|---|---|
synchronize_srcu(s) | Wait for all pre-existing readers; may spin |
call_srcu(s, head, fn) | Synchronize then invoke callback |
Two-Slot Mechanism
Active slot: s->idx (0 or 1)
srcu_read_lock: increment pcpu[cpu].c[s->idx]
srcu_read_unlock: decrement pcpu[cpu].c[idx]
synchronize_srcu:
1. Flip s->idx (new readers use new slot)
2. Wait until sum of pcpu[*].c[old_idx] == 0
3. Increment gp_seq
Initialization
DEFINE_SRCU(my_domain); // static
srcu_init(&my_domain); // runtime
Kernel Object Management System
Document: ZXF-KRN-KOMS-001
Revision: 1.0
Status: Released
1. Purpose
The Kernel Object Management System (KOMS) is the unified abstraction layer
for all reference-counted kernel objects. It defines a single base type,
kobject_t, that any subsystem may embed to obtain lifecycle management,
naming, attribute storage, event delivery, and hierarchical organization at
no additional per-subsystem cost.
2. Architectural Position
KOMS sits immediately above the memory allocator and synchronization primitives, and below all subsystems that manage named, reference-counted resources.
┌─────────────────────────────────────────────────────┐
│ Subsystems (IRQ, VMM, Device, Task, File, …) │
├─────────────────────────────────────────────────────┤
│ KOMS (koms.h / koms.c) │
├──────────────┬──────────────┬───────────────────────┤
│ kmalloc / │ spinlock / │ RCU │
│ slab │ rwlock │ │
└──────────────┴──────────────┴───────────────────────┘
KOMS is initialized once, after kmalloc_init(), before any subsystem that
registers a type or allocates a managed object.
3. Core Concepts
3.1 kobject_t
Every managed object embeds kobject_t as its first member. The base
object carries:
- An atomic reference counter (
kref_t). - A mandatory operations table (
kobject_ops_t) with areleasecallback. - A lifecycle state (
KOBJECT_UNINITIALIZED,KOBJECT_ALIVE,KOBJECT_DEAD). - A static name string.
- A 32-bit type identifier.
- A 32-bit flags word.
- Intrusive list nodes for parent/child hierarchy, namespace membership, attributes, and event listeners.
- An embedded
spinlock_tprotecting the mutable extension fields. - An
rcu_head_tfor deferred free.
The kobject_container() macro recovers the containing struct from a
kobject_t * pointer using compile-time offset arithmetic.
3.2 Type Registry
A kobj_type_t descriptor is registered once at boot per object class.
It carries:
| Field | Purpose |
|---|---|
type_id | Globally unique 32-bit identifier |
name | Human-readable string for diagnostics |
obj_size | sizeof of the containing struct |
cache | Optional dedicated slab cache |
kobj_ops | Mandatory ops table (must provide release) |
type_ops | Optional extended vtable (init, destroy, ns_add, ns_remove) |
After koms_init() the registry is append-only and read locklessly.
3.3 Namespace
A kobj_ns_t is an RCU-protected hash table of kobject_t pointers,
keyed by name. Namespaces form a tree rooted at koms_root_ns.
koms_root_ns
├── "irq"
│ ├── "ext-0x40"
│ └── "pgm-0x0d"
├── "vmm"
│ └── "kernel"
└── "device"
└── "dasd-0"
Reads use rcu_read_lock() and are fully lockless. Writes acquire the
namespace's write_lock (spinlock, irqsave).
3.4 Attributes
Attributes are kobj_attr_t nodes linked into kobject_t::attrs. Each
attribute has a name and optional get/set callbacks. The attribute list
is protected by kobject_t::lock.
3.5 Event Bus
Events are typed (kobj_event_type_t) and carry a payload union.
Listeners (kobj_listener_t) are registered per-object with an optional
event-type bitmask filter. Dispatch snapshots the listener list under the
object lock, then calls each listener without the lock, preventing deadlocks
on re-entrant dispatch. Events propagate up the parent chain automatically.
4. Lifecycle
koms_alloc()
│
▼
[refcount = 0]
│
koms_init_obj()
│
▼
KOBJECT_ALIVE ◄──── koms_get()
[refcount = 1]
│
koms_put() × N
│
[refcount = 0]
│
▼
KOBJECT_DEAD
│
ops->release()
│
▼
koms_free()
koms_freeze() sets KOBJ_FLAG_FROZEN, causing koms_get_unless_dead() to
fail without affecting existing references. This enables controlled
teardown: freeze the object, wait for all external references to drain, then
drop the final reference.
5. Allocation Strategy
koms_alloc(type, gfp)
│
├─ type->cache != nullptr ──► kmem_cache_alloc(type->cache, gfp | ZERO)
│
└─ type->cache == nullptr ──► kzalloc(type->obj_size, gfp)
koms_free() dispatches symmetrically. The KOBJ_FLAG_KOMS_ALLOC flag
distinguishes heap-allocated objects from statically embedded ones.
6. Thread Safety Summary
| Operation | Mechanism |
|---|---|
| Reference count | Lock-free (CS instruction) |
| Attribute list | kobject_t::lock (spinlock, irqsave) |
| Listener list | kobject_t::lock (spinlock, irqsave) |
| Child list | kobject_t::lock (spinlock, irqsave) |
| Namespace reads | rcu_read_lock() (lockless) |
| Namespace writes | kobj_ns_t::write_lock (spinlock, irqsave) |
| Type registry reads | Lockless (append-only after boot) |
| Type registry writes | type_registry_lock (spinlock, irqsave) |
7. Integration Guide
To integrate a subsystem with KOMS:
- Embed
kobject_tas the first member of the subsystem struct. - Define a
kobject_ops_twith areleasecallback that callskoms_free(). - Optionally define a
kobj_type_ops_tforinit/destroyhooks. - Define and register a
kobj_type_tfrom the subsystem's init function. - Allocate objects with
koms_alloc()and initialize withkoms_init_obj(). - Use
koms_get()/koms_put()for reference management. - Optionally register in a namespace with
koms_ns_add().
8. Initialization Order
KOMS must be initialized after kmalloc_init() and before any subsystem
that calls koms_type_register() or koms_alloc().
pmm_init → cma_init → mmu_init → vmm_init → slab_init → kmalloc_init
→ koms_init → smp_init → [subsystem inits]
Red-Black Tree
Document Revision: 26h1.1
Source: lib/rbtree.c, include/lib/rbtree.h
1. Overview
ZXFoundation™ provides a layered intrusive red-black tree library. Each layer is a strict superset of the one below it; callers of lower layers require no modification when higher layers are added.
| Layer | Type | Concurrency |
|---|---|---|
| 0 — Core | rb_root_t | None (caller-managed) |
| 1 — Augmented | rb_root_aug_t | None (caller-managed) |
| 2 — RCU-protected | rcu_rb_root_t | Lockless readers, serialised writers |
| 2A — RCU-augmented | rcu_rb_root_aug_t | Lockless readers, serialised writers + propagation |
| 3 — Per-CPU cached | rb_pcpu_cache_t | O(1) fast path per CPU |
The tree is intrusive: the caller embeds rb_node_t (or rb_node_aug_t) inside its own struct and recovers the container with rb_entry(). The colour bit is packed into bit 0 of the parent pointer, keeping rb_node_t at exactly 24 bytes.
2. Node Layout
rb_node_t (24 bytes)
┌──────────────────────────┐
│ left (8 B) │ pointer to left child
│ right (8 B) │ pointer to right child
│ parent_and_color (8 B) │ parent ptr | colour bit (bit 0)
└──────────────────────────┘
rb_node_aug_t (32 bytes)
┌──────────────────────────┐
│ node (rb_node_t, 24 B) │ must be at offset 0 — cast-compatible
│ subtree_max_gap (8 B) │ maintained by propagate callback
└──────────────────────────┘
All rb_node_t pointers are 8-byte aligned on s390x, so bit 0 of any valid pointer is always zero and is free for colour storage.
3. Layer 0 — Core
The core layer provides O(log n) insert, erase, and traversal with no locking. All operations are iterative (bounded stack depth).
Insert Protocol
walk tree → find (parent, link)
rb_link_node(node, parent, link)
rb_insert_fixup(tree, node)
Erase
rb_erase(tree, node)
Traversal
rb_first(tree) → minimum node
rb_last(tree) → maximum node
rb_next(node) → in-order successor
rb_prev(node) → in-order predecessor
rb_for_each(pos, tree)
rb_for_each_entry(pos, tree, member)
Container Recovery
rb_entry(ptr, type, member)
rb_entry_safe(ptr, type, member) ← null-safe variant
4. Layer 1 — Augmented
The augmented layer adds a rb_aug_callbacks_t to rb_root_aug_t. After every structural change (insert, erase, rotation), propagate is invoked bottom-up from the affected node to the root.
Callers embed rb_node_aug_t instead of rb_node_t and maintain a per-node subtree aggregate in subtree_max_gap.
Callbacks
propagate(node) recompute node->subtree_max_gap from children
copy(dst, src) copy aggregate when successor replaces deleted node
copy is required when the two-child erase case physically moves the successor into the deleted node's position. Without it the successor would carry a stale aggregate into its new location.
Propagation Order
structural change at node L
│
▼
propagate(L) ← children already up-to-date
│
▼
propagate(parent(L))
│
▼
… (up to root)
API
rb_root_aug_t root = RB_ROOT_AUG_INIT(&my_callbacks);
rb_insert_aug(&root, node, parent, link);
rb_erase_aug(&root, node);
5. Layer 2 — RCU-Protected
rcu_rb_root_t wraps rb_root_t with a write-side spinlock. Readers use the RCU lockless path; writers serialise through the lock and publish pointer updates via rcu_assign_pointer().
Concurrency Model
Reader Writer
────────────────────── ──────────────────────────────
rcu_read_lock() spin_lock_irqsave(&root->lock)
node = rcu_rb_find(...) rb_erase(...)
// use node safely rcu_assign_pointer(root, ...)
rcu_read_unlock() spin_unlock_irqrestore(...)
call_rcu(head, free_fn)
rcu_assign_pointer() issues smp_mb() before the store. rcu_dereference() issues a compiler barrier after each pointer load, preventing the compiler from collapsing multiple loads of the same pointer.
Erase and Grace Period
rcu_rb_erase(root, node, head, free_fn)
├─ unlink node under lock
├─ rcu_assign_pointer(...) ← publish updated tree
└─ call_rcu(head, free_fn) ← free after grace period
6. Layer 2A — RCU-Augmented
rcu_rb_root_aug_t composes Layer 1 and Layer 2 under a single write lock. The lock covers both rebalancing and aggregate propagation atomically.
Key invariant: readers always observe a tree where subtree_max_gap is consistent with the pointer structure they see, because both are updated under the same lock before rcu_assign_pointer() publishes the result.
Gap Search
rcu_rb_aug_find_gap() performs an O(log n) free-gap search by pruning subtrees whose subtree_max_gap is smaller than the requested size:
find_gap(root, size, align, lo, hi):
cursor = lo
n = root
while n:
if n.left.subtree_max_gap >= size:
descend left ← prune right subtree entirely
continue
aligned = align_up(cursor, align)
if aligned + size <= n.start:
return aligned ← gap found left of n
cursor = max(cursor, n.end)
n = n.right ← no gap left of n; try right
aligned = align_up(cursor, align)
if aligned + size <= hi:
return aligned ← gap after last node
return 0 ← no gap found
This replaces the former O(n) linear scan. The caller supplies node_start and node_end accessors, making the search generic over any interval type.
API
rcu_rb_root_aug_t root = RCU_RB_ROOT_AUG_INIT(&my_callbacks);
rcu_rb_aug_insert(&root, node, parent, link);
rcu_rb_aug_erase(&root, node, head, free_fn);
// Under lock or rcu_read_lock():
uint64_t addr = rcu_rb_aug_find_gap(&root, size, align, lo, hi,
node_start_fn, node_end_fn);
7. Layer 3 — Per-CPU Cached
rb_pcpu_cache_t is a per-CPU array of (hint, hint_key) pairs. On a cache hit the search returns in O(1) without touching the tree.
rb_find_cached(root, cache, cmp, arg):
cpu = current_cpu()
hint = cache[cpu].hint
if hint != NULL && cmp(hint, arg) == 0:
return hint ← O(1) fast path
// full O(log n) walk
result = tree_walk(root, cmp, arg)
cache[cpu].hint = result
return result
The hint is opportunistic — it may be stale. The comparator validates it before the result is returned.
Invalidation
rb_cache_invalidate(cache, node) O(MAX_CPUS) — call before erase
rb_cache_invalidate_local(cache) O(1) — current CPU only
rb_cache_invalidate() must be called before rb_erase() or rcu_rb_aug_erase() on any node in a cached tree to prevent dangling hint pointers.
8. RB-Tree Invariants
The implementation maintains the four standard invariants after every operation:
- Every node is RED or BLACK.
- The root is BLACK.
- Every RED node has two BLACK children.
- Every path from a node to a null leaf contains the same number of BLACK nodes.
Insert fixup resolves double-red violations with at most 2 rotations and O(log n) recolourings. Erase fixup resolves double-black violations with at most 3 rotations and O(log n) recolourings. Recolourings do not change pointer structure and are invisible to RCU readers.
9. Constraints
rb_node_aug_t::nodemust be at offset 0. The_Static_assertin the header enforces this.rb_aug_callbacks_t::copymay benullptronly if the caller guarantees no two-child erase will occur. For general use it must be provided.rb_cache_invalidate()must be called before erasing a node from any cached tree.rcu_rb_aug_find_gap()may be called underrcu_read_lock()for a best-effort result, or under the write lock for a guaranteed-current result.synchronize_rcu()may block indefinitely if a CPU never reports a quiescent state. Callers ofrcu_rb_aug_erase()must ensurercu_report_qs()is called from the idle loop and scheduler tick.
Time Subsystem
Document: ZXF-KRN-TIME-001 Revision: 26h1.0 Status: Draft
1. Overview
The time subsystem provides three services to the rest of the kernel:
- Monotonic kernel time (
ktime_t) — nanoseconds since boot, readable from any context. - Scheduler preemption — CPU timer fires EXT 0x1004 every 10 ms to enforce quanta.
- Deferred execution — clock comparator fires EXT 0x1005 to advance the per-CPU timer wheel.
All hardware access (STCKF, SPTC, STPTC, SCKC, STCKC, CR0 manipulation) is confined to arch/s390x/time/tod.c. The portable kernel layer in zxfoundation/time/ calls only the functions declared in include/arch/s390x/time/tod.h.
2. Hardware Sources
z/Architecture provides three per-CPU time mechanisms:
| Source | Instruction | Type | Resolution | Kernel use |
|---|---|---|---|---|
| TOD clock | STCKF | Global, monotonic | ~0.244 ns | ktime_get(), sleep deadline |
| CPU timer | SPTC / STPTC | Per-CPU countdown | Same as TOD | Scheduler quantum (10 ms) |
| Clock comparator | SCKC / STCKC | Per-CPU absolute | Same as TOD | Timer wheel advance |
The TOD clock is shared across all CPUs and is monotonic. STCKF reads it without serialization and is safe from hard-IRQ context.
3. TOD Unit Conversion
1 TOD unit = 1000/4096 ns = 125/512 ns
ktime_ns = tod_delta × 125 / 512
tod_units = ns × 512 / 125
Constants used throughout the subsystem:
TOD_1MS = 4 096 000 units
TOD_10MS = 40 960 000 units
TOD_1S = 4 096 000 000 units
4. Initialization Sequence
BSP:
time_init()
tod_set_boot_offset(STCKF) ← recorded once; never modified
timer_wheel_init() ← per-CPU wheel, level/slot arrays zeroed
tod_enable_ext_interrupts() ← CR0 bits 52+53 set
tod_cpu_timer_set(-10ms) ← first quantum armed
tod_clock_comparator_set(now + 1s) ← safe initial value
Each AP (from ap_startup):
time_init_ap()
timer_wheel_init()
tod_enable_ext_interrupts()
tod_cpu_timer_set(-10ms)
tod_clock_comparator_set(now + 1s)
tod_boot_offset is set on the BSP before any AP is started. APs call ktime_get() using the same offset — this is correct because the TOD clock is global.
5. Interrupt Dispatch
The EXT interrupt handler (do_ext_interrupt) intercepts the two time-critical subclasses before the generic irq_dispatch() path:
do_ext_interrupt:
ext_code = lowcore.ext_int_code
if ext_code == 0x1004 → time_cpu_timer_handler() // CPU timer
if ext_code == 0x1005 → time_clock_comparator_handler() // clock comparator
else → irq_dispatch(ZX_IRQ_BASE_EXT + ext_code, frame)
This avoids routing through the irqdesc table, whose 0x0400-entry limit cannot accommodate the full 16-bit EXT subclass space.
6. Timer Wheel
6.1 Structure
8 levels × 64 slots per CPU. Level 0 has 1 ms slot width; each subsequent level is 64× wider.
Level 0: slot = 1 ms, range = 64 ms
Level 1: slot = 64 ms, range = ~4 s
Level 2: slot = ~4 s, range = ~4 min
Level 3: slot = ~4 min, range = ~4.5 h
...
Level 7: slot = ~2 y, range = ~140 y
6.2 Placement
A timer with expiry delta d from now is placed in the lowest level l such that d < range(l), at slot (current_slot[l] + d/slot_width[l] + 1) % 64.
6.3 Advance
On EXT 0x1005, timer_wheel_advance(now) steps level-0 slot by slot, firing all expired timers. When level 0 completes a full revolution, it cascades timers from level 1 into lower levels, and so on.
6.4 Constraints
- All wheel operations require IRQs disabled on the calling CPU.
- Callbacks execute in hard-IRQ context. They must not block or acquire locks held by process context.
7. ktime_sleep()
Current implementation is a busy-wait:
deadline = STCKF + ns_to_tod(ns)
SCKC(deadline)
while STCKF < deadline: cpu_relax()
This is correct for early boot and short delays. Once the scheduler is operational, this will be replaced with a block/wake implementation using the timer wheel.
8. Strict Requirements
| # | Requirement |
|---|---|
| TIME-1 | ktime_get() is callable from any context. No lock, no sleep. |
| TIME-2 | Timer callbacks execute in hard-IRQ context. No blocking, no process-context locks. |
| TIME-3 | CPU timer must be reloaded on every time_cpu_timer_handler() invocation. |
| TIME-4 | Clock comparator must be reprogrammed after every timer_wheel_advance() call. |
| TIME-5 | tod_boot_offset is set once in time_init() and never modified. |
| TIME-6 | time_init_ap() must be called on every AP before the AP enters its idle loop. |
Scheduler
Subsystem Stubs
Document Revision: 26h1.1
The following subsystems have source directories and header files but are not yet implemented.
IRQ (arch/s390x/irq/)
Handles I/O interrupts from the channel subsystem. The I/O new PSW at lowcore 0x1E0 must point to the I/O interrupt handler. The handler calls TSCH to read the IRB and dispatches to the appropriate device driver.
Status: Stub — new PSW installed as disabled-wait.
Time (arch/s390x/time/)
Provides kernel timekeeping using the TOD (Time-of-Day) clock. The TOD clock is a 64-bit counter incremented at 4096 Hz. The boot timestamp is available in proto->tod_boot. The clock comparator interrupt (external interrupt subclass) drives the scheduler tick once the IRQ subsystem is active.
Status: Stub.
Build System Overview
Document Revision: 26h1.0
1. Prerequisites
| Tool | Minimum version | Notes | Required |
|---|---|---|---|
| CMake | 3.10 | Build system generator | true |
| Compiler and tools | toolchain-specific | See toolchains.md | partly |
| Ninja | any | Recommended generator | optional |
| dasdload | any | Needed for image generation (optional) | optional |
| Hercules | 4.x | Helpful for development | optional |
2. Output Artifacts
| Artifact | Description | Converted from |
|---|---|---|
core.zxfoundationloader00.sys | Stage 0 IPL record (tape format) | zxfl_stage1.elf → zxfl_stage1.bin |
core.zxfoundationloader01.sys | Stage 1 flat binary | zxfl_stage2.elf |
core.zxfoundation.nucleus | Kernel ELF64 (SHA-256 checksums patched in) | N/A |
sysres.3390 | Hercules 3390 DASD image | N/A |
bin2rec | Host tool | N/A |
zxsign | Host tool | N/A |
3. CMake Modules
| Module | Purpose |
|---|---|
cmake/dependencies.cmake | Host dependency checks |
cmake/configuration.cmake | OPT_LEVEL, DSYM_LEVEL cache variables |
cmake/platform.cmake | Platform detection |
cmake/standard.cmake | C standard enforcement |
cmake/hosttools.cmake | Build bin2rec and zxsign with host compiler |
cmake/source.cmake | Kernel source file lists (ZX_SOURCES_64) |
cmake/zxfl-compile.cmake | ZXFL Stage 0 and Stage 1 targets |
cmake/zxfoundation-compile.cmake | Kernel nucleus target |
cmake/run.cmake | dasd target — generates sysres.3390 |
4. Build Order
CMake enforces the following dependency chain:
tools (bin2rec, zxsign — host compiler)
│
├─► zxfl_stage1.elf
│ └─► zxfl_stage1.bin (objcopy)
│ └─► core.zxfoundationloader00.sys (bin2rec)
│
├─► zxfl_stage2.elf
│ └─► core.zxfoundationloader01.sys (objcopy)
│
└─► core.zxfoundation.nucleus
└─► zxsign patches .zxvl_checksums in-place
└─► sysres.3390 (dasdload)
Host tools are always compiled first with ZX_HOST_CC. The kernel and loader are compiled with the cross-compiler.
5. Configuration Variables (non-toolchain-specific, for toolchain-specific, see toolchains.md)
| Variable | Default | Description |
|---|---|---|
OPT_LEVEL | 2 | -O level for all targets |
DSYM_LEVEL | 0 | -g level (0 = no debug info) |
Override at configure time:
cmake -B build \
-DCMAKE_TOOLCHAIN_FILE=cmake/toolchain/zxfoundation-clang.cmake \
-DOPT_LEVEL=3
Toolchains
Document Revision: 26h1.0
1. Clang (cmake/toolchain/zxfoundation-clang.cmake)
Uses LLVM's built-in cross-compilation support — no separate cross-compiler installation is required on most systems.
cmake -B build \
-DCMAKE_TOOLCHAIN_FILE=cmake/toolchain/zxfoundation-clang.cmake \
-DMARCH_MODE=z14
| Role | Tool |
|---|---|
| C compiler | clang (or clang-$CLANG_VERSION) |
| Linker | ld.lld |
| Archiver | llvm-ar |
| objcopy | llvm-objcopy |
| Host CC | clang |
Set CLANG_VERSION in the environment to select a versioned binary (e.g. CLANG_VERSION=18 → clang-18). If unset, unversioned clang is used.
The target triple --target=s390x-unknown-none-elf is passed as a compile option (not via CMAKE_C_COMPILER_TARGET) to avoid CMake's compiler detection interfering with the freestanding build.
2. GCC (cmake/toolchain/zxfoundation-gcc.cmake)
Requires a s390x-ibm-linux-gnu-* cross-compiler toolchain installed on the host.
cmake -B build \
-DCMAKE_TOOLCHAIN_FILE=cmake/toolchain/zxfoundation-gcc.cmake
| Role | Tool |
|---|---|
| C compiler | s390x-ibm-linux-gnu-gcc |
| Linker | s390x-ibm-linux-gnu-ld |
| Archiver | s390x-ibm-linux-gnu-ar |
| objcopy | s390x-ibm-linux-gnu-objcopy |
| Host CC | gcc |
GCC-specific flags added to the kernel target:
| Flag | Reason |
|---|---|
-static-libgcc | Avoid libgcc DSO dependency |
-Wno-array-bounds | Suppress false positives from GCC's array-bounds analysis on lowcore pointer casts |
-fno-delete-null-pointer-checks | The kernel legitimately dereferences physical address 0x0 (the lowcore) |
-mzarch | Force z/Architecture mode |
3. Common Compiler Flags
Applied to all targets (loader and kernel):
| Flag | Reason |
|---|---|
-ffreestanding | No hosted C library assumptions |
-nostdlib | No implicit library linking |
-fno-builtin | Prevent compiler from substituting builtins with libc calls |
-fno-strict-aliasing | Kernel code casts between unrelated pointer types |
-fwrapv | Signed integer overflow wraps (defined behavior) |
-ftrivial-auto-var-init=pattern | Auto-initialize locals to a poison pattern — catches use-before-init |
-fno-stack-protector | No __stack_chk_guard — freestanding, no libc |
-msoft-float | No FPU use in kernel |
-mno-vx | No vector instructions in kernel |
Kernel-only additional flag:
| Flag | Reason |
|---|---|
-mpacked-stack | Use packed register save areas (reduces stack frame size) |
4. Custom Toolchain
To use a non-standard toolchain, copy one of the provided toolchain files and adjust the compiler/linker paths. The following CMake variables must be set:
| Variable | Description |
|---|---|
CMAKE_C_COMPILER | Path to the C compiler |
CMAKE_LINKER | Path to the linker |
CMAKE_OBJCOPY | Path to objcopy |
ZX_HOST_CC | Host C compiler for building bin2rec and zxsign |
COMPILER_ID | "clang" or "gcc" (selects compiler-specific flag sets) |
TARGET_EMULATION_MODE | elf64_s390 |
MARCH_MODE | Target microarchitecture (e.g. z10, z14, z16) |
Build Targets
Document Revision: 26h1.0
tools
Builds host-native bin2rec and zxsign using ZX_HOST_CC. This target is an implicit dependency of all other targets — it always runs first.
zxfl_stage1.elf → core.zxfoundationloader00.sys
Compiles Stage 0. Post-build steps:
objcopy -O binary zxfl_stage1.elf zxfl_stage1.bin— strip ELF headers to raw binary.bin2rec zxfl_stage1.bin core.zxfoundationloader00.sys— wrap in DASD IPL record format.
The linker script stage1.ld enforces a 12 KB size limit with ASSERT. The build fails if this limit is exceeded.
zxfl_stage2.elf → core.zxfoundationloader01.sys
Compiles Stage 1. Post-build step:
objcopy -O binary zxfl_stage2.elf core.zxfoundationloader01.sys— flat binary at0x20000.
core.zxfoundation.nucleus
Compiles the kernel. Post-build step:
zxsign core.zxfoundation.nucleus— computes SHA-256 for eachPT_LOADsegment and patches the digests into the.zxvl_checksumsELF section in-place.
The kernel linker script is arch/s390x/init/link.ld.
dasd → sysres.3390
Requires dasdload (from the Hercules package) on PATH.
- Remove any existing
sysres.3390. - Copy
scripts/etc.zxfoundation.parmto the build directory. - Run
dasdload -z scripts/sysres.conf sysres.3390— create a 3390 (compressed) DASD image and write all datasets. - Copy
scripts/hercules.cnfto the build directory.
sysres.conf defines the dataset layout: Stage 0, Stage 1, nucleus, and parmfile.
Running
cmake --build build # this build everything including DASD image
hercules -f build/hercules.cnf
In the Hercules console:
ipl 0100
bin2rec
Document Revision: 26h1.0
Source: tools/bin2rec.c
1. Purpose
bin2rec converts a flat binary into the DASD IPL record format required by the Hercules dasdload utility and the z/Architecture channel subsystem.
bin2rec <input.bin> <output.sys>
2. Background
The z/Architecture IPL mechanism reads the first physical record from the IPL device and loads it into memory at address 0x0. The record must be in a specific format: each 80-byte card image contains a header identifying it as a text record (TXT) or end record (END), a load address, a byte count, and 56 bytes of data.
This format originates from the IBM card-punch era — the DASD IPL record format is a direct descendant of the punched-card object deck format.
3. Record Format
Each 80-byte record:
| Bytes | Content |
|---|---|
| 0 | 0x02 (record type marker) |
| 1–3 | TXT in EBCDIC (0xE3 0xE7 0xE3) or END (0xC5 0xD5 0xC4) |
| 4 | 0x00 |
| 5–7 | Load address (24-bit, big-endian) |
| 8–9 | 0x00 0x00 |
| 10–11 | Byte count (0x0038 = 56, big-endian) |
| 12–15 | 0x00 0x00 0x00 0x00 |
| 16–71 | 56 bytes of binary data |
| 72–79 | 0x00 × 8 |
The tool reads 56 bytes at a time from the input binary, wraps each chunk in a TXT record, and writes an END record at the end.
4. Limitations
- Maximum input size: 32 KB (
MAX_REC_SIZE = 32768). This effectively caps stage 1 size at 32 KB. - Load address is 24-bit — intentional. The IPL PSW is a 31-bit ESA/390 PSW; the channel subsystem loads the record into the low 16 MB.
zxsign
Document Revision: 26h1.0
Source: tools/zxsign.c
1. Purpose
zxsign is a post-build host tool that computes SHA-256 digests for each PT_LOAD segment of the kernel ELF and patches them into the .zxvl_checksums section in-place.
zxsign <core.zxfoundation.nucleus>
The file is modified in place. It must be a valid ELF64 file with a .zxvl_checksums section.
2. Operation
- Read and validate the ELF64 header (magic,
EI_CLASS = ELFCLASS64). - Locate
.zxvl_checksumsby walking the section header table and the section name string table. - Collect all
PT_LOADprogram headers. Skip segments withp_filesz = 0and the segment containing.zxvl_checksumsitself (hashing the table while building it would be circular). - For each remaining
PT_LOADsegment, readp_fileszbytes fromp_offsetand compute SHA-256. - Build a
zxvl_checksum_table_twith magic0x5A58564C, version 1, algorithmZXVL_CKSUM_ALGO_SHA256, and one entry per segment. Physical addresses are computed by strippingCONFIG_KERNEL_VIRT_OFFSETfromp_paddr. - Seek to the file offset of
.zxvl_checksumsand write the complete table in onefwrite.
3. Checksum Table Layout
zxvl_checksum_table_t (packed):
uint32_t magic; // 0x5A58564C
uint32_t version; // 0x00000001
uint32_t algo; // 0x00000001 (SHA-256)
uint32_t count; // number of entries
entries[16]:
uint64_t phys_start // physical address of segment
uint64_t size // p_filesz
uint8_t digest[32] // SHA-256
The table is located at load_min + ZXVL_CKSUM_TABLE_OFFSET (0x80000) in the loaded kernel. The bootloader reads it from physical memory after loading all ELF segments.
4. Kernel Requirements
The kernel must define a .zxvl_checksums section anchored at the correct virtual address:
__attribute__((section(".zxvl_checksums")))
static volatile zxvl_checksum_table_t zxvl_cksum_table = { 0 };
The linker script must place .zxvl_checksums at HHDM_BASE + 0x80000