Per-CPU Data

Document Revision: 26h1.3 Sources: include/arch/s390x/cpu/lowcore.h, include/zxfoundation/percpu.h, arch/s390x/cpu/percpu.c


1. Layout

Each CPU's prefix area (lowcore) is a monolithic 8 KB block (two contiguous physical pages). The physical address of this block is loaded into the hardware prefix register via SPX. The prefix register transparently remaps absolute address 0x0000–0x1FFF to the CPU's own physical lowcore for all absolute-mode accesses.

The layout unifies hardware-assigned fields and software-defined per-CPU data into a single structure (zx_lowcore_t):

Physical Prefix Area (8 KB)
┌──────────────────────────────┐ 0x000
│  Hardware Lowcore            │   PSWs, interrupt codes, save areas (PoP §4)
├──────────────────────────────┤ 0x400  ← LC_PERCPU_OFFSET
│  Software Per-CPU Block      │   prefix_base, cpu_id, lock_depth,
│  (zx_percpu_t percpu)        │   MCS nodes, RCU state, PCP caches
├──────────────────────────────┤ 0x1200
│  Hardware Save Areas         │   GPRs, FPRs, CRs, ARs
└──────────────────────────────┘ 0x2000

2. Access — Current CPU

To access the current CPU's own per-CPU data, the kernel uses zx_lowcore(), which returns the HHDM-mapped pointer to the active lowcore. Because the prefix register already routes absolute-address-0 to this CPU's physical lowcore, and the HHDM maps physical 0 to CONFIG_KERNEL_VIRT_OFFSET, zx_lowcore() always resolves to the correct CPU without needing the prefix register value at all.

MacroDescription
percpu_get(field)Read a field from the current CPU's percpu block
percpu_set(field, val)Write a field to the current CPU's percpu block
percpu_inc(field)Increment a field in place
percpu_dec(field)Decrement a field in place
percpu_ptr_to(field)Pointer to a field in the current CPU's block

3. Access — Other CPUs (zx_lowcore_cpu)

3.1 The Hardware Prefix Aliasing Bug

Accessing another CPU's lowcore by index into a global pointer array is deceptively dangerous on s390x. Consider the global array __percpu_areas_raw[] where:

  • __percpu_areas_raw[0] = HHDM pointer to BSP lowcore = CONFIG_KERNEL_VIRT_OFFSET + 0
  • __percpu_areas_raw[1] = HHDM pointer to AP-1 lowcore = CONFIG_KERNEL_VIRT_OFFSET + P

When AP-1 (whose prefix register is P) reads a value from address CONFIG_KERNEL_VIRT_OFFSET + 0 (i.e., the BSP's HHDM lowcore), the MMU translates it to physical address 0. The prefix register then remaps physical 0 to physical P — so AP-1 silently reads its own lowcore, not the BSP's.

Symmetrically, when AP-1 reads from CONFIG_KERNEL_VIRT_OFFSET + P, the MMU translates it to physical P. The prefix register remaps physical P to physical 0 — so AP-1 silently reads the BSP's lowcore.

The result: every AP's cross-CPU lowcore lookup is silently swapped with the BSP's. IPI delivery, RCU quiescent-state tracking, and PMM per-CPU page caches all operated on the wrong CPU's data. The system "mostly worked" because the perfect symmetry of the swap caused IPIs to still reach all CPUs, masking the corruption.

3.2 The Safe Accessor: zx_lowcore_cpu(cpu)

__percpu_areas_raw[] must never be accessed directly. Use zx_lowcore_cpu(cpu) defined in include/zxfoundation/percpu.h, which applies an inverse prefix swap in software:

#define zx_lowcore_cpu(cpu)                                                    \
    ({                                                                          \
        zx_lowcore_t *__lc = __percpu_areas_raw[(cpu)];                        \
        zx_lowcore_t *__res = __lc;                                             \
        if (__lc) {                                                             \
            uint64_t __target_real = (uint64_t)__lc - CONFIG_KERNEL_VIRT_OFFSET;\
            uint64_t __my_prefix   = zx_lowcore()->percpu.prefix_base;         \
            if (__target_real == __my_prefix)                                   \
                __res = (zx_lowcore_t *)CONFIG_KERNEL_VIRT_OFFSET;             \
            else if (__target_real == 0)                                        \
                __res = (zx_lowcore_t *)(CONFIG_KERNEL_VIRT_OFFSET + __my_prefix);\
        }                                                                       \
        __res;                                                                  \
    })

How it works: if the target's physical address matches my_prefix, the hardware would have swapped it to 0, so we manually redirect to HHDM + 0 (the BSP). If the target's physical address is 0, the hardware would have swapped it to my_prefix, so we redirect to HHDM + my_prefix. Any other CPU is unaffected (no swap applies).

The cross-CPU access macros all go through this accessor:

MacroDescription
percpu_get_on(cpu, field)Read from another CPU's percpu block
percpu_set_on(cpu, field, val)Write to another CPU's percpu block
percpu_ptr_on(cpu, field)Pointer to a field in another CPU's block

4. Initialization

FunctionWhen CalledEffect
percpu_init_bsp()Once, early in main.cRegisters BSP lowcore (physical 0x0) in __percpu_areas_raw[0]
percpu_init_ap(cpu_id, cpu_addr, node)Once per AP in smp_init()Allocates 8 KB (order-1), installs prefix via SPX, registers in __percpu_areas_raw[cpu_id]

5. Fields (zx_percpu_t)

FieldTypePurpose
prefix_baseuint64_tPhysical address of this CPU's lowcore (used by zx_lowcore_cpu)
cpu_iduint16_tLogical CPU ID (0 = BSP)
cpu_addruint16_tz/Arch CPU address (STAP result); used for SIGP
lock_depthuint32_tqspinlock nesting depth
lock_nodes[MAX_LOCK_DEPTH]mcs_node_t[]MCS queue nodes for qspinlock
rcu_gp_sequint64_tRCU grace-period sequence (written by BSP)
rcu_qs_sequint64_tRCU quiescent-state sequence (written by this CPU)
in_rcu_read_sideuint8_t1 if inside rcu_read_lock()
ipi_pending_countuint32_tPending IPI completion counter
ap_stack_topuint64_tInitial AP stack pointer (physical, set before SIGP Restart)
pcp[ZONE_MAX]pmm_pcplist_t[]Per-CPU PMM order-0 page caches, one per memory zone

6. Assembly Offsets

Key lowcore offsets used by entry.S and head64.S are defined as named constants in include/arch/s390x/cpu/lowcore.h and verified at compile time by _Static_assert:

ConstantValueField
LC_ASYNC_STACK0x0350zx_lowcore_t::async_stack
LC_MCCK_STACK0x0368zx_lowcore_t::mcck_stack
LC_KERNEL_STACK0x0348zx_lowcore_t::kernel_stack
LC_RESTART_STACK0x0360zx_lowcore_t::restart_stack
LC_KERNEL_ASCE0x0388zx_lowcore_t::kernel_asce
LC_PERCPU_OFFSET0x0400zx_lowcore_t::percpu
LC_CPU_ID_OFFSET0x0408zx_percpu_t::cpu_id (within percpu block)