vmm.dev

concurrency control 1

concurren...concurrency control 1xv6xv6krabskrabsososvmm.devvmm.dev

concurrency control 1

Related: xv6, krabs.

Most operating systems, including the teaching OS that this note came from, run code concurrently. The kernel is one no-std program image, and every hart can enter the same kernel functions and touch the same static data. Code can run at the same time on multiple CPUs, and code can also be interrupted and resumed through interrupt handling and process switching.

There is no separate runtime object that owns kernel execution in this model. A process is scheduler state, not a separate copy of kernel code. An interrupt or trap is another entry path on a CPU. The synchronization problem is that all cores can run the same kernel code against shared memory.

On an SMP machine, all CPUs share memory and I/O. Concurrent code therefore often touches the same data. If multiple CPUs access a shared resource without coordination, they can corrupt that data or observe states that no single sequential execution would have produced.

shared bus shared memory CPU 1 cache CPU 2 cache CPU 3 cache CPU 4 cache

This class of bug is called a race condition. The part of the program that can cause the race is a critical section. The shared resource manipulated by that critical section is a critical resource.

The problem is not simply that two CPUs run at the same time. The problem is that reads and writes against a critical resource can interleave in a way that the programmer did not intend. The usual fix is to protect the critical resource: only one critical section that manipulates that resource may execute at a time.

An operating system implementer therefore needs mechanisms that provide the following properties:

  • Critical sections coordinate with each other, usually by synchronization that implements mutual exclusion.
  • Access to a critical resource does not wait forever.

Synchronization primitives arbitrate access to resources. Mutual exclusion is built on top of those primitives.

There are also different time scales for mutual exclusion. Short-term mutual exclusion protects short critical sections with low-level primitives such as spinlocks. Long-term mutual exclusion coordinates higher-level activity, usually across processes or blocking waits. This note starts from the low-level side and builds a short-term mutual exclusion primitive, a spinlock, from a Rust point of view. It also uses the same reasoning to explain related types that derive safety from the same mutual-exclusion ideas.

Later, higher-level process synchronization and longer-term resource allocation policies can be built on top of these foundations.

spinlocks

A spinlock is a short-term mutual exclusion mechanism used to prevent race conditions in short critical sections. It uses a loop and an atomic CPU instruction to ensure that at most one CPU path enters the critical section at a time.

In Rust, the interface can look like this. Here the spinlock is exposed as a Mutex, for "mutual exclusion".

let m: Mutex<usize> = Mutex::new(5);
{
    let mut val = m.lock(); // acquire the lock
    // critical section
    *val = 6;
} // release the lock

In this example, Mutex wraps a usize critical resource. The wrapped data cannot be accessed unless lock() has acquired the lock. If another CPU has already acquired the lock, lock() loops until the lock becomes available. The name "spin" comes from this behavior: while waiting, the CPU spins in a loop. When the acquired guard val leaves scope, drop() runs automatically and releases the lock.

This Rust shape is much better than the conventional C shape. In C, the lock state is usually stored separately from the protected data, and explicit acquire() and release() calls are inserted around the critical section.

struct lock vallock;
int val = 5;
acquire(&vallock);
// critical section
val = 6;
release(&vallock);

In C, it is easy to forget either acquire() or release(). It is also possible to access val without respecting the lock at all. In the Rust design, the type system ties access to the protected data to the acquired guard, and automatic destruction releases the lock. This removes a large class of manual locking mistakes.

The next question is how a spinlock is built. The short answer is: with help from hardware atomic instructions.

atomic memory operations

When multiple CPUs access the same memory location at the same time, the relative order of their memory operations is not automatically the order a programmer wants. If one CPU writes a memory location while another CPU reads it, and no synchronization exists, the program cannot rely on a useful ordering between those two operations. If multiple CPUs update shared data this way, timing can corrupt the data.

Consider a simple increment of a memory location named counter. The following pseudo-assembly loads counter, adds one, and stores the result:

ld   a0, counter
addi a0, a0, 1
sd   a0, counter

Now suppose counter starts at 0, and two CPUs run this sequence concurrently. The intended final value is 2.

counter starts at 0 P1 on CPU1 P2 on CPU2 read read write write

The outcome can still be 1:

timeCPU1 instructionCPU1 a0counterCPU2 instructionCPU2 a0
1ld a0, counter00ld a0, counter0
2addi a0, a0, 110addi a0, a0, 11
3sd a0, counter11sd a0, counter1

Each CPU has its own register set, so the a0 register on CPU1 and the a0 register on CPU2 are independent. The memory, however, is shared. Both CPUs read the old value, both increment their private copy, and both write 1. One increment is lost.

This is the race condition. The three-instruction sequence is the critical section, and counter is the critical resource.

To prevent this, shared-memory access must be synchronized. Most CPUs provide atomic read-modify-write instructions for exactly this purpose.

An atomic operation appears indivisible to the other CPUs participating in the memory system. In an atomic read-modify-write operation, the CPU reads a value from memory, computes a new value, and writes it back to the same memory location as one atomic operation. Other CPUs observe a coherent order for that operation; they do not observe the read and write as independently interleavable pieces.

RISC-V provides atomic memory operations through the A extension: AMO instructions, and LR/SC instructions. LR/SC means load-reserved and store-conditional. This note mostly uses AMO instructions because they are simpler to explain.

For example, a RISC-V amoadd.w can atomically add one to a word in memory. In real RISC-V assembly the memory address is held in a register, so the shape is closer to this:

# a1 holds 1
# a2 holds the address of counter
amoadd.w a0, a1, (a2)

The instruction reads the old value at counter into a0, adds a1 to the memory value, and writes the new value back. The read-modify-write operation is atomic. This does not mean the CPU globally disables interrupts, and it does not mean every other memory operation in the machine stops. It means this particular memory operation is observed as one atomic operation for that address and ordering domain.

If CPU1 and CPU2 both run the atomic increment, the final counter value is 2. Which CPU runs first is not defined, but the two atomic operations are not lost into each other:

timeCPU1 instructionCPU1 a0counterCPU2 instructionCPU2 a0
1amoadd.w a0, a1, (a2)010
202amoadd.w a0, a1, (a2)1

An operating system can use these primitive atomic read-modify-write operations to build stronger synchronization mechanisms.

A spinlock can be implemented with instructions such as amoswap or with an LR/SC loop. amoswap atomically swaps the value in a register with the value at a memory address:

# a0 holds the address of LOCK
amoswap.w t0, t0, (a0)

The old value stored at LOCK is loaded into t0, and the previous value of t0 is written to LOCK. If this were written as an ordinary load followed by an ordinary store, another CPU could access the same location between the two operations.

With amoswap, a simple spinlock can be sketched like this:

    la  a0, LOCK
    li  t0, 1
lock_retry:
    amoswap.w t1, t0, (a0)
    bnez t1, lock_retry
    # ...
    # critical section
    # ...
    amoswap.w x0, x0, (a0)

Here LOCK stores the lock state. 0 means unlocked, and 1 means locked.

  1. t0 is initialized to 1.
  2. amoswap.w stores the old value of LOCK in t1 and writes 1 to LOCK.
  3. If the old value was 0, the lock was acquired. If the old value was not 0, another CPU already held the lock, so the code retries.
  4. Unlocking writes 0 back to LOCK. x0 is the RISC-V zero register, so amoswap.w x0, x0, (a0) writes zero and discards the old value.

This prevents two CPUs from passing the lock acquisition point at the same time. But it is still incomplete. It handles atomicity for the lock variable, but it does not yet handle memory ordering for the critical section itself.

memory models

Memory accesses are not always observed in the exact order written in the program. CPUs, caches, store buffers, interconnects, and compilers all exist to make memory systems fast, and performance often depends on allowing some reordering.

In SMP, memory is globally reachable, but each CPU may issue and observe memory operations through local pipelines, buffers, and caches. A memory model defines which reorderings are allowed and which orderings must be preserved.

RISC-V uses RVWMO, the RISC-V weak memory ordering model.

RVWMO is usually discussed with three orders:

  • Program order: the order of memory operations in the instruction stream of one CPU.
  • Global memory order: the single global order in which memory operations are considered to occur for the memory model.
  • Preserved program order: the subset of program order that must also be preserved in global memory order.

The following diagram is an abstraction. CPU-local program order is drawn on the sides. Global memory order is drawn in the center. Only the preserved subset is guaranteed to keep the same relative order globally.

CPU1 program order global memory order CPU2 program order P1-a P1-b P1-c P1-d P1-e P1-f P2-a P2-b P2-c P2-d P2-e P2-f P1-b P2-a P1-a P2-c P1-c P1-d P1-e P1-f PPO must remain ordered globally.

In the diagram, the order P1-c -> P1-d -> P1-e -> P1-f is preserved in global memory order because it belongs to the preserved program order. Other operations may appear in a different global order than their local program order. The diagram does not show all CPU2 operations in global memory order because its purpose is only to show the distinction.

For this OS-level Rust discussion, the important part of preserved program order is explicit synchronization: using barriers or acquire/release annotations to force the relevant order.

One memory barrier is the RISC-V fence instruction. fence constrains memory accesses before the fence, called predecessors, with respect to memory accesses after the fence, called successors. The predecessor and successor sets specify whether reads, writes, I/O, or combinations are ordered. For example, fence rw, rw, fence r, rw, and fence rw, w express different ordering constraints.

RISC-V atomic instructions also have two ordering bits: aq and rl.

  • If neither bit is set, the atomic instruction has no extra ordering constraint beyond atomicity for that operation.
  • If only aq is set, the atomic operation has acquire semantics: later memory operations in program order cannot be globally observed before the acquire operation.
  • If only rl is set, the atomic operation has release semantics: earlier memory operations in program order must be globally observed before the release operation.
  • If both aq and rl are set, the operation has both acquire and release semantics.

Now return to the earlier spinlock:

    la  a0, LOCK
    li  t0, 1
lock_retry:
    amoswap.w t1, t0, (a0)
    bnez t1, lock_retry
    # ...
    # critical section
    # ...
    amoswap.w x0, x0, (a0)

This code has no ordering constraint around the critical section. Loads and stores in the critical section may be reordered, as observed globally, in ways that defeat the intended lock discipline. The lock word itself is updated atomically, but the protected data is not necessarily ordered with respect to lock acquisition and release.

The AMO ordering bits fix that:

    sd x1, MEM1
    ld x2, MEM2
    la a0, LOCK
    li t0, 1
lock_retry:
    amoswap.w.aq t1, t0, (a0)
    bnez t1, lock_retry
    # ...
    # critical section
    # ...
    amoswap.w.rl x0, x0, (a0)
    sd x3, MEM3
    ld x4, MEM4

The aq bit on the lock acquisition prevents memory operations inside the critical section from being globally ordered before the acquire. The rl bit on the unlock prevents memory operations inside the critical section from being globally ordered after the release.

The added sd x1, MEM1, ld x2, MEM2, sd x3, MEM3, and ld x4, MEM4 are outside the critical section. They are shown only to make the direction of the acquire and release constraints explicit.

With these constraints, mutual exclusion protects not only the lock variable but also the memory operations inside the critical section.

The same style can be expressed with fence instructions:

    sd x1, MEM1
    ld x2, MEM2
    la a0, LOCK
    li t0, 1
lock_retry:
    amoswap.w t1, t0, (a0)
    bnez t1, lock_retry
    fence r, rw
    # ...
    # critical section
    # ...
    fence rw, w
    amoswap.w x0, x0, (a0)
    sd x3, MEM3
    ld x4, MEM4

For the acquire side, the fence belongs after a successful acquisition, not before the branch that tests whether acquisition succeeded. Putting it after the branch avoids paying the acquire fence on failed attempts and avoids describing a failed exchange as if it had acquired the lock.

fence r, rw is a conventional way to express acquire-like ordering. fence rw, w is a conventional way to express release-like ordering. These fences are usually coarser than the AMO aq and rl bits, because they order broader classes of operations.

In practice, OS code should not hand-write assembly every time it needs a lock. Rust exposes atomic operations and memory ordering through core::sync::atomic.

Rust atomics and ordering

Rust provides atomic versions of several primitive types, such as AtomicUsize, AtomicBool, and AtomicPtr. Atomic methods can safely mutate through shared references such as &AtomicBool, because the mutation is performed atomically.

The assembly examples used a LOCK variable that holds 0 or 1. In Rust, the simplest representation is AtomicBool. Hardware instructions such as amoswap, amoor, or LR/SC loops are compiler targets for atomic operations. The exact instruction sequence is a compiler and target decision; the Rust program specifies semantics, not a specific opcode.

Memory ordering is expressed with the Ordering enum. Important variants include Relaxed, Acquire, Release, AcqRel, and SeqCst.

Atomic operations receive an Ordering argument, and core::sync::atomic::fence() receives an Ordering argument for fences. On RISC-V, a typical lowering for read-modify-write operations has this shape, although exact output may vary by compiler, target, and operation:

Rust operation shapeRISC-V shape
Atomic*::method(..., Ordering::Relaxed)amo<op>.{w,d} or an LR/SC loop
Atomic*::method(..., Ordering::Acquire)amo<op>.{w,d}.aq or equivalent
Atomic*::method(..., Ordering::Release)amo<op>.{w,d}.rl or equivalent
Atomic*::method(..., Ordering::AcqRel)amo<op>.{w,d}.aqrl or equivalent
Atomic*::method(..., Ordering::SeqCst)at least acquire-release semantics, plus whatever the compiler emits to satisfy Rust's sequential consistency requirements

Similarly, fences are commonly lowered like this on RISC-V:

Rust fenceRISC-V shape
fence(Ordering::Acquire)fence r, rw
fence(Ordering::Release)fence rw, w
fence(Ordering::SeqCst)fence rw, rw or a stronger compiler-selected sequence

The following Rust code demonstrates the intended relationship between Rust atomics and RISC-V ordering annotations. The emitted assembly is only an example; a different compiler version may choose LR/SC or a different AMO form.

#![no_std]
#![no_main]

use core::sync::atomic::*;

#[no_mangle]
pub extern "C" fn _start() {
    let atomic_bool = AtomicBool::new(false);
    atomic_bool.swap(true, Ordering::Relaxed);
    fence(Ordering::Acquire);
    atomic_bool.swap(true, Ordering::Acquire);
    fence(Ordering::Release);
    atomic_bool.swap(true, Ordering::Release);
    fence(Ordering::SeqCst);
    atomic_bool.swap(true, Ordering::AcqRel);
    atomic_bool.swap(true, Ordering::SeqCst);
}

One possible RISC-V output for a similar build is:

0000000000011120 <_start>:
   11120: 1141        addi    sp,sp,-16
   11122: 00010423    sb      zero,8(sp)
   11126: 4505        li      a0,1
   11128: 002c        addi    a1,sp,8
   1112a: 40a5a62f    amoor.w       a2,a0,(a1)
   1112e: 0230000f    fence         r,rw
   11132: 44a5a62f    amoor.w.aq    a2,a0,(a1)
   11136: 0310000f    fence         rw,w
   1113a: 42a5a62f    amoor.w.rl    a2,a0,(a1)
   1113e: 0330000f    fence         rw,rw
   11142: 46a5a62f    amoor.w.aqrl  a2,a0,(a1)
   11146: 46a5a52f    amoor.w.aqrl  a0,a0,(a1)
   1114a: 0141        addi    sp,sp,16
   1114c: 8082        ret

These atomic types and functions become the building blocks for higher-level concurrent types in a small OS.

a simple lock

With atomics and memory barriers available, we can implement the earlier Mutex as a spinlock.

use core::cell::UnsafeCell;
use core::sync::atomic::{AtomicBool, Ordering};

pub struct Mutex<T> {
    locked: AtomicBool,
    data: UnsafeCell<T>,
}

impl<T> Mutex<T> {
    pub const fn new(data: T) -> Self {
        Self {
            locked: AtomicBool::new(false),
            data: UnsafeCell::new(data),
        }
    }
}

AtomicBool stores whether the lock is held. This corresponds to the LOCK variable in the assembly spinlock.

The protected T is wrapped in UnsafeCell<T>. This is Rust's primitive for interior mutability. A shared &Mutex<T> cannot ordinarily produce a mutable reference to its contents. UnsafeCell makes that possible, but only unsafely. The rest of the Mutex implementation must therefore enforce the missing rule:

  • A reference to the inner data can only be produced after the lock has been acquired.
  • The lock is released automatically when the access guard is dropped.

The type that represents acquired access is MutexGuard:

pub struct MutexGuard<'a, T> {
    lock: &'a Mutex<T>,
}

lock() returns this guard:

impl<T> Mutex<T> {
    pub fn lock(&self) -> MutexGuard<'_, T> {
        while self.locked.swap(true, Ordering::Acquire) {
            core::hint::spin_loop();
        }
        MutexGuard { lock: self }
    }
}

The lock starts as false. To acquire it, the code atomically swaps in true. If the previous value was already true, another CPU or interrupt-level path holds the lock, so this CPU waits. Ordering::Acquire ensures that operations after the successful acquisition are ordered after the acquisition. core::hint::spin_loop() lets the compiler emit a machine hint when the target has one.

Once the lock is acquired, the guard can provide Deref, DerefMut, and Drop:

use core::ops::{Deref, DerefMut};

impl<T> Drop for MutexGuard<'_, T> {
    fn drop(&mut self) {
        self.lock.locked.store(false, Ordering::Release)
    }
}

impl<T> Deref for MutexGuard<'_, T> {
    type Target = T;

    fn deref(&self) -> &T {
        // Safety: this guard exists only while the lock is held.
        unsafe { &*self.lock.data.get() }
    }
}

impl<T> DerefMut for MutexGuard<'_, T> {
    fn deref_mut(&mut self) -> &mut T {
        // Safety: this guard is the unique mutable access path.
        unsafe { &mut *self.lock.data.get() }
    }
}

The Drop implementation releases the lock with Ordering::Release. This ensures that operations performed while the guard existed are ordered before the unlock. Deref and DerefMut are unsafe internally because they dereference the raw pointer inside UnsafeCell, but the public interface is safe because the guard proves that the lock is held and because references derived from the guard cannot outlive the guard.

The type is still incomplete. It needs marker-trait reasoning:

unsafe impl<T: Send> Sync for Mutex<T> {}
unsafe impl<T: Sync> Sync for MutexGuard<'_, T> {}

Send and Sync are central to Rust's concurrency safety, so the next section spells out why these bounds matter.

lock safety

Rust uses the marker traits Send and Sync to express concurrency safety. In this no-std kernel, they should be read as compiler-level facts about values crossing CPU-visible ownership and sharing boundaries, not as API hooks for spawning work.

  • Send means ownership of a value can cross a kernel scheduling or CPU boundary without causing undefined behavior. For example, the value may be stored in process state, a global queue, or another structure that a different hart later runs with or drops.
  • Sync means a shared reference &T can be exposed through static or shared kernel data and used by multiple harts, or by both base-level and interrupt-level paths, without data races.

"Safely" means without causing data races or other undefined behavior.

Most core and alloc Rust types are automatically Send and Sync when their fields are. Some types are intentionally not. Unsafe code may implement these traits manually, but doing so is a promise that the type upholds the corresponding invariant. The examples here are no-std examples: Arc means alloc::sync::Arc, Rc means alloc::rc::Rc, and the cell and atomic types come from core.

Some examples:

  • &T is Send when T is Sync, because copying a shared reference into another CPU-visible path shares access to T.
  • &T is also Sync when T is Sync, because sharing a reference to &T is still only shared access to T.
  • &mut T is Send when T is Send, because the exclusive reference can carry ownership-like access across a scheduling or CPU boundary.
  • &mut T is Sync when T is Sync, because sharing &&mut T only gives shared access to the underlying T.
  • Raw pointers such as *const T, *mut T, and NonNull<T> are not automatically Send or Sync.
  • Rc<T> is neither Send nor Sync, because its reference count is not atomic.
  • AtomicUsize, AtomicBool, and other atomic types are Send and Sync.
  • Arc<T> can be Send and Sync when T satisfies the required bounds, because its reference count is atomic and it can expose shared &T.
  • UnsafeCell<T> is Send when T is Send, but it is not Sync, because it allows mutation through shared references without synchronization.
  • Cell<T> and RefCell<T> follow the same broad pattern: they can be Send when T is Send, but they are not Sync.

These traits let Rust reject many unsafe concurrent programs at compile time. Our spinlock must fit into that model.

The Mutex<T> type contains UnsafeCell<T>:

pub struct Mutex<T> {
    locked: AtomicBool,
    data: UnsafeCell<T>,
}

Because UnsafeCell<T> is not Sync, Mutex<T> would not be Sync automatically. That would make it impossible to share &Mutex<T> between CPUs, which is exactly what a lock must support. The lock makes sharing safe, so the implementation provides Sync manually:

unsafe impl<T: Send> Sync for Mutex<T> {}

The bound is T: Send, not T: Sync. A Mutex<T> gives one CPU path at a time mutable access to T. That access may move values out of T and later store or drop them elsewhere, so the protected type must be safe to transfer. It need not be safe to share by ordinary &T without the mutex.

MutexGuard also needs thought:

pub struct MutexGuard<'a, T> {
    lock: &'a Mutex<T>,
}

It is safe for &MutexGuard<T> to be shared only when T: Sync, because Deref can expose &T through the guard. Therefore:

unsafe impl<T: Sync> Sync for MutexGuard<'_, T> {}

Whether MutexGuard may cross a CPU handoff depends on the lock design. For a pure spinlock with no CPU-local unlock state, moving a guard across such a boundary can be acceptable if the implementation explicitly supports it. For the interrupt-disabling kernel lock later in this note, dropping the guard affects CPU-local interrupt state, so the technically correct design is to make that guard !Send. If an implementation omits that marker, the kernel must rely on scheduler invariants and runtime assertions instead; that is a weaker encoding of the same rule.

At this point we have a simple spinlock. It is useful, but not every critical resource should be protected by a spinlock.

where spinlocks belong

A spinlock works well when the critical section is short. It is a poor long-term mutual exclusion primitive. While a CPU waits for a spinlock, it does no useful work. If the lock is held for too long, the whole system can lose performance.

Contention also matters. If many CPUs compete for one spinlock, performance can collapse. Three broad techniques help:

  1. Use the right lock granularity.
  2. Disable local interrupts while holding spinlocks that may also be used by interrupt handlers.
  3. Do not sleep or context-switch while holding arbitrary spinlocks. A scheduler may deliberately switch while holding a process lock, but that lock must be a special handoff lock with its own invariant.

Lock granularity is the decision of how many locks exist and how much data each lock protects. A coarse-grained design uses a small number of locks to protect large structures. A fine-grained design uses more locks, each protecting a smaller structure.

Coarse locking is simpler:

struct SharedData {
    data1: i32,
    data2: i32,
}

let data = Arc::new(Mutex::new(SharedData { data1: 0, data2: 0 }));
{
    let mut data = data.lock();
    data.data1 += 1;
    data.data2 += 1;
}

Here, both fields are protected by one lock. Any CPU that wants either field must wait while both updates run.

Fine locking can allow more parallelism:

struct SharedData {
    data1: Mutex<i32>,
    data2: Mutex<i32>,
}

let data = Arc::new(SharedData {
    data1: Mutex::new(0),
    data2: Mutex::new(0),
});

{
    let mut data1 = data.data1.lock();
    *data1 += 1;
}
{
    let mut data2 = data.data2.lock();
    *data2 += 1;
}

Now an operation on data1 does not automatically block an unrelated operation on data2. Finer granularity is not always better, though. It increases complexity and lock overhead. The goal is not to scatter locks everywhere. The goal is to avoid contention between unrelated work.

Interrupts are the next problem. Suppose a CPU holds a spinlock and is interrupted before it releases that lock. Other CPUs waiting for the same lock may spin for longer than necessary.

CPU0 base code holds L interrupt handler interrupt CPU1 wait for L CPU2 wait for L

A worse case is deadlock. Deadlock occurs when CPU paths wait forever because each needs a resource held by another path.

Interrupt-level deadlock can occur when a CPU already holds a spinlock, then an interrupt handler on the same CPU tries to acquire the same lock. The interrupted base-level code cannot resume to release the lock because the interrupt handler is spinning.

one CPU base code holds L handler wants L The holder cannot run. The waiter is the interrupt handler. The CPU spins forever.

For locks that can be taken both by base-level code and interrupt handlers on the same CPU, the usual rule is: disable local interrupts while the lock is held.

Disabling interrupts is not enough if the kernel can context-switch while a spinlock is held. If the lock holder sleeps or switches out, other CPUs can spin until that holder is scheduled again. If the waiters have interrupts disabled while spinning, the system can reach a deadlock because the lock holder never gets the CPU time needed to release the lock.

P1 P2 P3 holds lock L switched out spins for L runs instead P1 must run again to release L.

A harder deadlock shape appears when the lock holder is asleep or otherwise not scheduled, while every CPU that could run useful work is spinning for that same lock.

P1 sleeping, holds L CPU1 P2 waits L CPU2 P3 waits L P4 P5 The CPUs spin, so the sleeping holder never gets scheduled to release L.

The safe kernel rule is stronger than "release locks during context switch." Ordinary locks must be released before sleeping or entering a path that may block. octox follows the xv6-style exception for the process lock: sched() switches while holding exactly that process lock, asserts that it is held, and asserts that the interrupt-lock nesting depth is 1. That makes the process lock a scheduler handoff lock, not a general permission to switch while holding arbitrary spinlocks.

The classic AB-BA deadlock is another case. Suppose one path acquires lock A and then lock B, while another path acquires lock B and then lock A.

CPU1 let a = A.lock(); ... let b = B.lock(); CPU2 let b = B.lock(); ... let a = A.lock(); A then B on one CPU, B then A on another.

Whether it happens depends on timing, but the danger is always present. The usual fix is a global lock order. If two locks may be held together, all paths acquire them in the same order.

CPU1 let a = A.lock(); ... let b = B.lock(); CPU2 let a = A.lock(); ... let b = B.lock(); Both paths acquire A before B.

Recursive locking is another deadlock source. A non-recursive spinlock deadlocks if the current CPU path tries to acquire a lock it already holds. Kernel spinlocks are usually intentionally non-recursive because recursive locks can hide design errors and make lock-order reasoning harder.

Some of these problems can be addressed with Rust types. A scheduler switch function can require the relevant lock state in its signature. Interrupt disabling can be built into the lock guard. Recursive acquisition can be detected by recording the CPU that owns the lock. The next section improves the lock in that direction.

a practical kernel lock

The simple spinlock should disable local interrupts while the lock is held, at least for locks that may be used from interrupt context.

The desired interface is guard-based. Calling lock_mycpu() disables interrupts on the current CPU, records the nesting depth, and returns a guard. Dropping the guard decrements the depth, and if the depth becomes zero, restores the previous interrupt state.

Because interrupt state is per-CPU, the nesting depth and previous interrupt-enabled state are per-CPU too:

pub static CPUS: Cpus = Cpus::new();

pub struct Cpus([UnsafeCell<Cpu>; NCPU]);

pub struct Cpu {
    pub noff: isize,
    pub intena: bool,
}

Cpus wraps each Cpu in UnsafeCell because the global CPUS value is static but the per-CPU state changes at runtime. Per-CPU state is not shared with other CPUs, but it is only safe to access when local interrupts are disabled. Otherwise an interrupt could switch execution paths on the same CPU and corrupt the state.

noff counts the interrupt-disable nesting depth. intena records whether interrupts were enabled before the outermost interrupt lock was taken.

The guard can be empty:

pub struct IntrLock;

IntrLock means "local interrupts are disabled for this CPU." If all IntrLock guards on that CPU have been dropped, interrupts may be restored to their previous state.

impl Cpus {
    pub unsafe fn cpu_id() -> usize {
        let id;
        asm!("mv {0}, tp", out(reg) id);
        id
    }

    pub unsafe fn mycpu(&self) -> *mut Cpu {
        let id = Self::cpu_id();
        self.0[id].get()
    }

    pub fn lock_mycpu() -> IntrLock {
        let old = intr_get();
        intr_off();
        unsafe { (&mut *CPUS.mycpu()).locked(old) }
    }
}

cpu_id() reads the current hart or CPU ID from tp. It is unsafe because the CPU-local pointer derived from it is only protected while local interrupts are disabled. With interrupts enabled, a trap or scheduler path can run before the caller finishes using that pointer, and the invariant protecting the per-CPU slot no longer holds. mycpu() returns a raw pointer for the same reason: the caller must maintain the interrupt-disabled precondition while using it.

lock_mycpu() records the previous interrupt state, disables interrupts, and then updates the current CPU's nesting state.

impl Cpu {
    const fn new() -> Self {
        Self {
            noff: 0,
            intena: false,
        }
    }

    fn locked(&mut self, old: bool) -> IntrLock {
        if self.noff == 0 {
            self.intena = old;
        }
        self.noff += 1;
        IntrLock
    }

    pub fn unlock(&mut self) {
        assert!(!intr_get(), "core unlock - interruptible");
        assert!(self.noff >= 1, "unlock");
        self.noff -= 1;
        if self.noff == 0 && self.intena {
            intr_on()
        }
    }
}

impl Drop for IntrLock {
    fn drop(&mut self) {
        unsafe { (&mut *CPUS.mycpu()).unlock() }
    }
}

unlock() asserts that interrupts are still disabled. It decrements noff, and if this was the outermost interrupt lock and interrupts had been enabled before it was acquired, it re-enables them.

Now combine IntrLock with MutexGuard:

use core::marker::PhantomData;

pub struct Mutex<T> {
    locked: AtomicBool,
    data: UnsafeCell<T>,
}

pub struct MutexGuard<'a, T> {
    lock: &'a Mutex<T>,
    _intr_lock: IntrLock,
    _not_send: PhantomData<*mut ()>,
}

impl<T> Mutex<T> {
    pub fn lock(&self) -> MutexGuard<'_, T> {
        let intr_lock = Cpus::lock_mycpu();
        while self.locked.swap(true, Ordering::Acquire) {
            core::hint::spin_loop();
        }
        MutexGuard {
            lock: self,
            _intr_lock: intr_lock,
            _not_send: PhantomData,
        }
    }
}

impl<T> Drop for MutexGuard<'_, T> {
    fn drop(&mut self) {
        self.lock.locked.store(false, Ordering::Release)
    }
}

The order matters. Interrupts must be disabled before attempting to acquire the spinlock. Otherwise an interrupt could occur after the lock is acquired but before interrupts are disabled.

Unlock order matters too. The spinlock must be released while interrupts are still disabled. In this shape, MutexGuard::drop() first stores false into the lock. After the drop() body returns, Rust drops the fields, including _intr_lock, which may restore interrupts. That is the desired order: release the spinlock, then restore interrupts.

A lock that disables local interrupts is tied to CPU-local state. The guard must not be dropped on a different CPU from the one that acquired it. The PhantomData<*mut ()> field is a no-std way to make the guard !Send while still allowing a deliberate unsafe impl<T: Sync> Sync if shared references to the guard are valid for T. The current octox implementation does not encode this marker; it relies on kernel paths that carry guards across swtch being constrained. sched() requires the current process lock to be held, requires it to be the only spinlock-derived interrupt lock (noff == 1), and restores the saved interrupt-enable state after returning from the scheduler. That is a kernel invariant, not a property provided by ordinary Rust ownership alone.

We can also detect recursive locking by recording the owning CPU:

use core::ptr;
use core::sync::atomic::{AtomicPtr, Ordering};

pub struct Mutex<T> {
    name: &'static str,
    locked: AtomicPtr<Cpu>,
    data: UnsafeCell<T>,
}

impl<T> Mutex<T> {
    pub const fn new(data: T, name: &'static str) -> Self {
        Self {
            name,
            locked: AtomicPtr::new(ptr::null_mut()),
            data: UnsafeCell::new(data),
        }
    }
}

Instead of storing a Boolean, locked stores a pointer to the owning Cpu, or null if the lock is free. This makes it possible to ask whether the current CPU already holds the lock.

impl<T> Mutex<T> {
    unsafe fn holding(&self) -> bool {
        self.locked.load(Ordering::Relaxed) == CPUS.mycpu()
    }

    pub fn lock(&self) -> MutexGuard<'_, T> {
        let intr_lock = Cpus::lock_mycpu();

        unsafe {
            assert!(!self.holding(), "acquire {}", self.name);
            while self
                .locked
                .compare_exchange(
                    ptr::null_mut(),
                    CPUS.mycpu(),
                    Ordering::Acquire,
                    Ordering::Relaxed,
                )
                .is_err()
            {
                core::hint::spin_loop();
            }
        }

        MutexGuard {
            lock: self,
            _intr_lock: intr_lock,
            _not_send: PhantomData,
        }
    }
}

impl<T> Drop for MutexGuard<'_, T> {
    fn drop(&mut self) {
        unsafe {
            assert!(!intr_get(), "interrupts enabled");
            assert!(self.lock.holding(), "release {}", self.lock.name);
        }
        self.lock.locked.store(ptr::null_mut(), Ordering::Release)
    }
}

holding() compares the recorded owner with the current CPU pointer. It is unsafe because CPUS.mycpu() is only valid under the interrupt-disabled invariant.

The recursive-lock assertion runs after interrupts are disabled, so the current CPU pointer is stable under the kernel invariant. The acquisition uses compare_exchange: if locked is null, store the current CPU pointer; otherwise keep spinning. Ordering::Acquire is used on success. The failure ordering is Relaxed because a failed attempt does not enter the critical section.

Unlocking asserts that interrupts are disabled and that the current CPU owns the lock, then stores null with Ordering::Release.

Sometimes OS code needs to drop a lock and reacquire it around a scheduler boundary. A convenience method can consume the guard and return the mutex reference:

impl<T> Mutex<T> {
    pub fn unlock(guard: MutexGuard<'_, T>) -> &Mutex<T> {
        guard.lock
    }

    pub unsafe fn force_unlock(&self) {
        assert!(self.holding(), "force unlock {}", self.name);
        self.locked.store(ptr::null_mut(), Ordering::Release);
        (&mut *CPUS.mycpu()).unlock()
    }
}

unlock() consumes the guard. force_unlock() is unsafe because it bypasses the normal guard lifetime and must only be used at carefully audited scheduler boundaries. A kernel implementation should document exactly which invariant makes the call safe.

In octox, sleep(chan, guard) uses this shape: acquire the current process lock, consume and drop the original guard, mark the process sleeping, call sched() with the process lock, and reacquire the original lock after wakeup. fork_ret() is the exceptional path that calls force_unlock() because the first return into a forked process still inherits the process lock from the scheduler handoff, and passing the original guard through that stack transition is awkward.

This gives us the core spinlock design: atomic acquisition and release, acquire/release ordering, local interrupt discipline, owner tracking, and recursive-lock detection.

OnceLock

A static variable sometimes needs to be initialized at runtime and then never changed. A Mutex could protect that initialization, but the intended operation is simpler: write exactly once, then read forever. That is the role of OnceLock.

Example:

pub static INITPROC: OnceLock<Arc<Proc>> = OnceLock::new();

INITPROC.set(p.clone()).unwrap();
assert!(!Arc::ptr_eq(&p, INITPROC.get().unwrap()), "init exiting");

INITPROC is initialized once at runtime. set(&self, value: T) attempts to set the value and returns a Result. get() returns Some(&T) if initialization is complete, and None otherwise.

OnceLock is a concurrent type. It uses the same atomic and guard reasoning as a spinlock. The simplified implementation below does not include interrupt or scheduler integration. That is acceptable for early initialization paths, but it is not a complete blocking primitive for every kernel context.

use core::cell::UnsafeCell;
use core::mem::MaybeUninit;
use core::sync::atomic::{AtomicUsize, Ordering};

const INCOMPLETE: usize = 0x0;
const POISONED: usize = 0x1;
const RUNNING: usize = 0x2;
const COMPLETE: usize = 0x3;

pub struct OnceLock<T> {
    state: AtomicUsize,
    inner: UnsafeCell<MaybeUninit<T>>,
}

struct OnceLockGuard<'a, T> {
    oncecell: &'a OnceLock<T>,
    poison: bool,
}

impl<T> OnceLock<T> {
    pub const fn new() -> Self {
        Self {
            state: AtomicUsize::new(INCOMPLETE),
            inner: UnsafeCell::new(MaybeUninit::uninit()),
        }
    }
}

INCOMPLETE, POISONED, RUNNING, and COMPLETE encode the initialization state. MaybeUninit<T> holds memory that may not yet contain a valid T. UnsafeCell provides the interior mutability needed to write into that memory through &self.

Core accessors:

impl<T> OnceLock<T> {
    fn is_initialized(&self) -> bool {
        self.state.load(Ordering::Acquire) == COMPLETE
    }

    unsafe fn get_unchecked(&self) -> &T {
        (*self.inner.get()).assume_init_ref()
    }

    pub fn set(&self, value: T) -> Result<(), T> {
        let mut value = Some(value);
        self.get_or_init(|| value.take().unwrap());
        match value {
            None => Ok(()),
            Some(value) => Err(value),
        }
    }

    pub fn get_or_init(&self, func: impl FnOnce() -> T) -> &T {
        match self.get_or_try_init(func) {
            Ok(res) => res,
            Err(_) => loop {
                if self.state.load(Ordering::Acquire) == COMPLETE {
                    break unsafe { self.get_unchecked() };
                }
                core::hint::spin_loop()
            },
        }
    }
}

is_initialized() uses Acquire so reads after seeing COMPLETE are ordered after the initialization release. get_unchecked() is only safe after the value is initialized.

get_or_init() returns the initialized value. If another CPU is already initializing the cell, this simplified version spins until the state becomes COMPLETE.

The initialization path:

impl<T> OnceLock<T> {
    fn try_get(&self) -> Result<&T, usize> {
        match self.state.load(Ordering::Acquire) {
            POISONED => panic!("poisoned"),
            COMPLETE => Ok(unsafe { self.get_unchecked() }),
            INCOMPLETE => Err(INCOMPLETE),
            RUNNING => Err(RUNNING),
            _ => unreachable!(),
        }
    }

    pub fn get_or_try_init(&self, func: impl FnOnce() -> T) -> Result<&T, ()> {
        match self.try_get() {
            Ok(res) => Ok(res),
            Err(RUNNING) => Err(()),
            Err(INCOMPLETE) => {
                let mut func = Some(func);
                let res = self.try_init_inner(&mut || func.take().unwrap()())?;
                Ok(res)
            }
            Err(_) => unreachable!(),
        }
    }

    #[cold]
    fn try_init_inner(&self, func: &mut dyn FnMut() -> T) -> Result<&T, ()> {
        unsafe {
            let mut guard = self.try_block(Ordering::Acquire)?;
            let inner = &mut *self.inner.get();
            inner.as_mut_ptr().write(func());
            guard.poison = false;
        }
        Ok(unsafe { self.get_unchecked() })
    }

    fn try_block(&self, order: Ordering) -> Result<OnceLockGuard<'_, T>, ()> {
        match self
            .state
            .compare_exchange(INCOMPLETE, RUNNING, order, Ordering::Acquire)
        {
            Ok(INCOMPLETE) => Ok(OnceLockGuard {
                oncecell: self,
                poison: true,
            }),
            _ => Err(()),
        }
    }

    fn unblock(&self, state: usize, order: Ordering) {
        self.state.swap(state, order);
    }
}

impl<T> Drop for OnceLockGuard<'_, T> {
    fn drop(&mut self) {
        let state = if self.poison { POISONED } else { COMPLETE };
        self.oncecell.unblock(state, Ordering::AcqRel);
    }
}

try_get() reads the state. COMPLETE returns the initialized value. RUNNING means another CPU path is initializing. POISONED means a previous initialization panicked before completing.

get_or_try_init() avoids monomorphizing the cold initialization body by wrapping the FnOnce in an Option and passing a FnMut closure to try_init_inner().

try_block() is the lock step: it changes INCOMPLETE to RUNNING with compare_exchange. The returned OnceLockGuard is responsible for marking the cell COMPLETE or POISONED. octox uses an atomic swap with Ordering::AcqRel when the guard leaves scope. The release half is the essential publication step: the write to the initialized value happens before readers observe COMPLETE with Acquire. The acquire half is conservative for this read-modify-write unblock operation.

OnceLock also needs to drop the initialized value:

impl<T> Drop for OnceLock<T> {
    fn drop(&mut self) {
        if self.is_initialized() {
            unsafe { (*self.inner.get()).assume_init_drop() }
        }
    }
}

assume_init_drop() is safe only when the value is initialized, so the state check is the guard.

The marker trait implementations are:

unsafe impl<T: Send> Send for OnceLock<T> {}
unsafe impl<T: Sync + Send> Sync for OnceLock<T> {}

OnceLock<T> owns a T, so it can be sent only when T: Send. Shared access can expose &T, and a value can be installed through a shared &OnceLock<T>, so Sync requires both T: Sync and T: Send.

The remaining utility methods are:

impl<T> OnceLock<T> {
    pub fn get(&self) -> Option<&T> {
        if self.is_initialized() {
            Some(unsafe { self.get_unchecked() })
        } else {
            None
        }
    }

    pub fn get_mut(&mut self) -> Option<&mut T> {
        if self.is_initialized() {
            Some(unsafe { self.get_unchecked_mut() })
        } else {
            None
        }
    }

    unsafe fn get_unchecked_mut(&mut self) -> &mut T {
        (*self.inner.get()).assume_init_mut()
    }

    pub fn take(&mut self) -> Option<T> {
        if self.is_initialized() {
            self.state = AtomicUsize::new(INCOMPLETE);
            unsafe { Some((*self.inner.get()).assume_init_read()) }
        } else {
            None
        }
    }

    pub fn into_inner(mut self) -> Option<T> {
        self.take()
    }
}

get_mut(&mut self) is safe because &mut self proves exclusive access to the OnceLock. A mutable static is different: reading or writing a static mut OnceLock<T> still requires unsafe reasoning about who can access it.

For example:

pub static mut KVM: OnceLock<Kvm> = OnceLock::new();

// Initialize the one kernel page table.
//
// Safety: this function must run while only one CPU can access KVM.
pub unsafe fn kinit() {
    unsafe {
        KVM.set(Kvm::new().unwrap()).unwrap();
        KVM.get_mut().unwrap().make();
    }
}

kinit() is safe only if it runs during a single-CPU initialization phase, or under another synchronization rule that makes access exclusive. The unsafe fn states the precondition; the internal unsafe block states where the implementation relies on it.

LazyLock

LazyLock initializes a value the first time it is accessed. It is implemented using OnceLock. When the use case fits, LazyLock is often more readable than calling OnceLock directly.

Example:

pub static PROCS: LazyLock<Procs> = LazyLock::new(|| Procs::new());

pub fn init() {
    for (i, proc) in PROCS.pool.iter().enumerate() {
        proc.data_mut().kstack = kstack(i);
    }
}

The first access to PROCS initializes it. Here that first access occurs through Deref when PROCS.pool.iter() is evaluated.

A minimal implementation looks like this:

use core::cell::Cell;
use core::ops::Deref;

pub struct LazyLock<T, F = fn() -> T> {
    cell: OnceLock<T>,
    init: Cell<Option<F>>,
}

impl<T, F> LazyLock<T, F> {
    pub const fn new(init: F) -> Self {
        Self {
            cell: OnceLock::new(),
            init: Cell::new(Some(init)),
        }
    }
}

impl<T, F: FnOnce() -> T> LazyLock<T, F> {
    pub fn force(this: &LazyLock<T, F>) -> &T {
        this.cell.get_or_init(|| match this.init.take() {
            Some(f) => f(),
            None => panic!("Lazy instance has previously been poisoned"),
        })
    }
}

impl<T, F: FnOnce() -> T> Deref for LazyLock<T, F> {
    type Target = T;

    fn deref(&self) -> &Self::Target {
        LazyLock::force(self)
    }
}

F = fn() -> T is the default type parameter. cell stores the initialized value. init stores the initialization function in a Cell<Option<F>>.

Cell is not Sync, but init is only touched by the closure passed into OnceLock::get_or_init(). That closure runs only for the CPU path that wins initialization, so access to init is serialized by OnceLock.

Cell::take() replaces the Option<F> with None and returns the old value. On the normal path, this happens once. If initialization has already completed, get_or_init() returns the stored value and never calls the closure again.

Deref calls force(), so ordinary access initializes the value lazily.

The marker trait implementation is:

unsafe impl<T: Sync + Send, F: Send> Sync for LazyLock<T, F> {}

LazyLock contains Cell, so it is not automatically Sync. The implementation is safe when OnceLock<T> is sync-safe, which requires T: Sync + Send, and when the initializer can be moved into the CPU path that runs initialization, which requires F: Send. F does not need to be Sync, because the initializer itself is not concurrently shared.

This gives a small OS the basic concurrency-control types needed to move toward a multicore design: atomics, spinlocks, interrupt-aware guards, one-time initialization, and lazy initialization.