vmm.dev

system output

system ou...system outputLinuxLinuxlinux tro...linux troubleshootingkernel patchkernel patchvirtualiz...virtualized timeシステム出力システム出力vmm.devvmm.dev

system output

Output surfaces for Linux guests

Related: Linux, linux troubleshooting, kernel patch, virtualized time, system output ja.

System output is not just logs. For Linux and guest OS work on a virtualized platform, useful evidence comes from several surfaces with different lifetimes:

  • live kernel messages,
  • tracing streams,
  • BPF maps and ring buffers,
  • crash dumps,
  • RAM-backed persistent records,
  • host or hypervisor traces,
  • raw memory left after failure.

The goal is not to collect everything. The goal is to preserve the evidence that answers one question: which layer owned the failed resource when the system crossed from healthy to broken?

A logging-only view is too small for systems work. A machine can reboot before userspace starts. A guest can blame a driver while the real event is a missed virtual interrupt. RAM can contain the only surviving record. This note is the broader map I want when debugging those systems.

guest kernel Linux guest / vendor kernel printk pr_* / dev_* ring buffer /dev/kmsg dmesg live kernel log tracefs ftrace events BPF maps / ringbuf perf counts / samples postmortem RAM records that survive long enough to read after failure pstore ramoops vmcore kdump / crash dump RAM image guest userspace journal stdio core dump host / hypervisor do not merge timelines too early vCPU IRQ MMIO reset choose the surface by lifetime: before crash, during crash, after reset, or outside the guest.

The evidence model

For this note, I split output into four classes.

ClassExamplesWhen it helps
Live logsdmesg -w, journalctl -f, serial console capturethe system stays alive long enough to stream
Live tracingftrace, perf, bpftrace, BPF CO-RE toolsthe symptom is reproducible and timing matters
Persistent RAM recordspstore, ramoops, ftrace-ramoops, console-ramoopsthe system reboots, panics, or loses userspace
External evidenceserial console, hypervisor trace, monitor dump, RAM dumpthe guest is not trustworthy or is already dead

The classes have different failure modes. A journal can be late. A userspace buffer can be circular. A trace buffer can perturb timing. A ramoops region can be overwritten by the next crash. A host trace can describe a vCPU that the guest thought was running. Keep the surface name with the evidence.

I use this escalation path:

  1. Event count: count how often something happens.
  2. Profiling: sample where time is spent.
  3. Tracing: record timestamped events and reconstruct ordering.
  4. Postmortem: inspect records left in RAM or a dump after the live system is gone.

Do not start with the most powerful tool. Start with the cheapest surface that can answer the current hypothesis.

Kernel messages: printk, dmesg, and kmsg

Kernel messages are still the first source for boot, driver, and panic work. Kernel code commonly emits messages through printk, pr_*, and dev_* helpers:

pr_info("driver init\n");
pr_warn("retrying register read\n");
dev_err(dev, "probe failed: %d\n", ret);

For device drivers, prefer dev_* where possible. It attaches device context, which matters when the same driver instance exists in several guests or several device-tree nodes.

The path is:

kernel call site
  -> printk record
  -> kernel log buffer
  -> /dev/kmsg
  -> dmesg, journald, log plumbing, or a collector

Useful commands:

dmesg --time-format=iso --decode
dmesg --level=err,warn
dmesg --follow
journalctl -k -b --no-pager
cat /proc/sys/kernel/printk
sysctl kernel.printk

console_loglevel controls which messages reach the console. That is separate from whether the record exists in the kernel buffer. During boot-loop work, console visibility is a debugging interface, not just a cosmetic setting.

cat /proc/sys/kernel/printk
# console_loglevel default_message_loglevel minimum_console_loglevel default_console_loglevel

dmesg -s or --buffer-size still matters when reading older notes and compatibility paths:

dmesg --buffer-size 262144

On modern systems, /dev/kmsg is usually the important interface, but the operational rule is older than the implementation detail: a noisy failure can overwrite the earlier message that explained the cause. Preserve complete output before adding more instrumentation.

RAM-backed records: pstore and ramoops

RAM is often the best storage medium left after the crash. pstore is the kernel interface for persistent storage of crash records. ramoops is a pstore backend that writes records into a reserved RAM area.

The important property is lifetime. A normal log needs userspace and storage. A ramoops record needs a reserved memory region that survives reset well enough for the next boot to read it.

Typical records appear under:

mount -t pstore pstore /sys/fs/pstore 2>/dev/null || true
ls -l /sys/fs/pstore
cat /sys/fs/pstore/console-ramoops-0
cat /sys/fs/pstore/dmesg-ramoops-0
cat /sys/fs/pstore/ftrace-ramoops-0

In a device tree, a reserved ramoops region looks conceptually like this:

reserved-memory {
    #address-cells = <2>;
    #size-cells = <2>;
    ranges;

    ramoops@8f000000 {
        compatible = "ramoops";
        reg = <0 0x8f000000 0 0x100000>;
        record-size = <0x4000>;
        console-size = <0x4000>;
        ftrace-size = <0x4000>;
        pmsg-size = <0x4000>;
    };
};

For platform bring-up or image qualification, I check:

cat /proc/iomem | grep -iE 'ramoops|pstore|reserved'
cat /proc/cmdline | tr ' ' '\n' | grep -iE 'ramoops|pstore|memmap'
zcat /proc/config.gz 2>/dev/null | grep -E 'PSTORE|RAMOOPS'
grep -R . /sys/module/ramoops/parameters 2>/dev/null

If pstore is empty, do not assume the system did not panic. Check the reserved address, overlap with firmware carveouts, whether the reset preserved the memory, whether ECC or boot firmware scrubbed RAM, and whether another crash already overwrote the record.

For shutdown or late panic debugging, ftrace can also be recorded to pstore:

mount -t debugfs debugfs /sys/kernel/debug 2>/dev/null || true
echo 1 > /sys/kernel/debug/pstore/record_ftrace
reboot -f
tail /sys/fs/pstore/ftrace-ramoops-0

That is valuable when the final live console only says "rebooting" but the RAM record still has the function path.

Crash dumps and raw RAM

pstore is small. For memory corruption, page allocator bugs, use-after-free, or a panic after large state buildup, you often need a larger dump.

Common Linux paths:

cat /proc/vmcore              # only inside the capture kernel
crash vmlinux vmcore
makedumpfile -d 31 /proc/vmcore vmcore.filtered

Guest and hypervisor paths vary:

  • firmware or monitor RAM dump,
  • hypervisor guest physical memory dump,
  • offline memory capture,
  • crash kernel with reserved crashkernel=,
  • platform minidump framework,
  • core dump plus pstore plus platform RAM dump.

When taking a raw RAM dump, record the metadata next to the file:

machine / SKU / revision
kernel image build id
vmlinux or unstripped symbols
System.map and /proc/kallsyms if available
guest physical address layout
reserved memory map
hypervisor version
reset reason
clock source and boot id

Without address layout and symbols, a RAM image is just bytes. With them, it can answer concrete questions: which task was running, which lock was held, which interrupt was pending, which page was poisoned, and whether a buffer still contains the record that userspace never flushed.

Useful live memory state before a crash:

cat /proc/iomem
cat /proc/meminfo
cat /proc/vmallocinfo | head
cat /proc/slabinfo | head
cat /proc/buddyinfo
cat /proc/pagetypeinfo 2>/dev/null | head
cat /proc/zoneinfo | head

Useful debug features when you can build the kernel:

CONFIG_KASAN
CONFIG_KFENCE
CONFIG_KCSAN
CONFIG_DEBUG_PAGEALLOC
CONFIG_PAGE_OWNER
CONFIG_DEBUG_VM
CONFIG_SLUB_DEBUG
CONFIG_PSTORE
CONFIG_PSTORE_RAM
CONFIG_PSTORE_FTRACE
CONFIG_KEXEC_CORE
CONFIG_CRASH_DUMP

These are not free. On constrained targets they change timing, memory pressure, and sometimes cache behavior. Use them intentionally and record that they were enabled.

GDB for cores and live state

GDB is useful when the evidence is a core dump, a live process, a reproduced crash under controlled conditions, or a small piece of machine state extracted from a larger failure. It is not a replacement for pstore, ftrace, BPF, or a vmcore reader. Use it when you have symbols, registers, mappings, and a concrete question about control flow or data state.

The capture rule is simple: keep the executable, shared libraries, debug files, build ID, core file, and the command line together. A backtrace without matching symbols is often just a timestamped guess.

Opening a core

gdb -q /path/to/program /path/to/core

Inside GDB:

set pagination off
set print pretty on
set print frame-arguments all
set disassemble-next-line on
info files
info proc mappings
info sharedlibrary
info threads
thread apply all bt full

If symbols live outside the target root:

set sysroot /path/to/sysroot
set solib-search-path /path/to/sysroot/lib:/path/to/sysroot/usr/lib
set debug-file-directory /path/to/debug
sharedlibrary

For reproducible local runs, keep the run environment explicit:

set args --config /tmp/repro.conf
set environment RUST_BACKTRACE=1
run

Commands I use constantly

CommandUse
btshort backtrace
bt fullbacktrace with locals
thread apply all bt fullall thread stacks, usually the first command on a core
info registersregister state at the selected frame
info framesaved registers, frame address, caller information
frame Nselect a stack frame
up / downmove through frames
listshow source near the current line
info localslocal variables
info argsfunction arguments
ptype exprprint type layout
p exprprint expression
p/x exprprint expression in hex
x/16gx addrexamine 16 eight-byte words
x/32xb addrexamine 32 bytes
disassemble /m funcmixed source and assembly
info symbol addrmap an address to a symbol
maintenance info sectionsloaded sections and ranges

Crash triage pattern

For a native crash, I usually start with:

set pagination off
thread apply all bt full
info registers
info proc mappings
info files

Then select the crashing frame and check arguments and nearby memory:

thread 1
frame 0
info args
info locals
p/x $pc
x/8i $pc-16
x/32gx $sp

If the program counter is inside a shared object but symbols are missing, use the mappings:

info proc mappings
info symbol 0x7f1234567890

For a stripped binary, the build ID and mapping base are more useful than guessing. Find the mapped file and load matching symbols:

add-symbol-file /path/to/libfoo.so.debug 0x7f1234500000

Data corruption pattern

When the symptom is corrupted data, the question is often whether the value is invalid, stale, freed, or written by the wrong owner.

p object
ptype object
p/x object->state
x/64gx object
info symbol object->callback

For lists and queues, inspect both the container and the pointed-to elements. A valid-looking pointer can still point into the wrong lifetime:

p/x node
x/4gx node
info proc mappings

If the address is outside any mapping, the core and binary are not enough. Move to pstore, vmcore, allocator debugging, page owner, KASAN, KFENCE, or a RAM image.

Live attach pattern

Attach only when the pause is acceptable:

gdb -q -p "$(pidof target-service)"

Inside GDB:

set pagination off
info threads
thread apply all bt
continue

For short probes, prefer breakpoint commands that collect and continue:

break suspicious_function
commands
  silent
  printf "hit suspicious_function arg=%p\n", arg
  bt 4
  continue
end

If attach changes timing too much, stop using GDB and move to ftrace, perf, or BPF.

Kernel-adjacent use

For a kernel vmcore, use crash or a kernel-aware debugger first. GDB can inspect raw symbols and disassembly, but it does not understand many kernel data structures by itself.

Useful kernel-adjacent GDB cases still exist:

  • decoding a userspace core that was produced by a system failure,
  • checking a small repro program,
  • validating a driver helper compiled as ordinary code,
  • inspecting an address or disassembly from a trace,
  • attaching through kgdb when the kernel is configured for it.

For kgdb-style sessions, keep the symbol file exact:

gdb -q vmlinux
target remote /dev/ttyS0
info registers
bt
lx-symbols

The lx-* commands depend on the kernel's GDB scripts and the debugging environment. If they are unavailable, do not pretend the view is complete. Fall back to crash, vmcore analysis, ftrace, BPF, and RAM-backed records.

Guest userspace output surfaces

A Linux guest can have a full userspace logging stack, a minimal initramfs, or almost no userspace at all. Do not treat userspace logs as a replacement for kernel evidence.

The main userspace surfaces are:

SurfaceWhat it gives
service logsrecords emitted by init systems, daemons, and applications
dmesgkernel ring buffer when accessible
/sys/fs/pstorekernel crash, console, ftrace, pmsg records after reboot
core dumpsnative crash state with registers, stacks, mappings, and fds
support bundlepackaged state across services and system interfaces
service introspectionservice-specific state snapshots

Basic capture:

dmesg --time-format=iso --decode > dmesg.txt
journalctl -b --no-pager > journal.txt
journalctl -k -b --no-pager > journal-kernel.txt
ls -l /sys/fs/pstore
cp -a /sys/fs/pstore ./pstore
coredumpctl list 2>/dev/null

Userspace logs can be structured, buffered, filtered, rate-limited, and delayed. A native crash may produce a core dump or a crash handler record only if the relevant handler and storage path were alive at the time of failure.

For platform work, read crash records like state snapshots, not like ordinary logs. A useful record includes registers, stack traces, memory maps, open file descriptors, and the exact binary build ID. If you have matching unstripped symbols, symbolize the stack instead of guessing from raw program counters.

On a hypervisor, userspace symptoms add one more split:

application or service symptom
  -> native service, IPC, or syscall path
  -> guest kernel driver / irq / scheduler
  -> virtual device model or hypervisor event
  -> physical device / firmware

If a service is blocked, userspace logs may show the symptom and nothing about the lower layer. Check IPC state, scheduler traces, guest interrupts, and host-side virtual device events.

Event count, profiling, tracing

A useful progression is:

  1. Event counting tells you whether a class of event is abnormal.
  2. Profiling tells you where time is being spent.
  3. Tracing tells you the order of events.

That progression avoids a common failure: recording a huge trace before you know which signal matters.

Event counting examples:

perf stat -e task-clock,context-switches,cpu-migrations,page-faults -a -- sleep 10
perf stat -e cycles,instructions,branches,branch-misses -a -- sleep 10
perf stat -e cache-references,cache-misses -a -- sleep 10

Profiling examples:

perf record -F 99 -a -g -- sleep 10
perf report
perf script > out.perf

Tracing examples:

trace-cmd record -e sched:sched_switch -e irq:* -e workqueue:* sleep 5
trace-cmd report

For a guest OS, run the count on both sides if possible. A guest can show low CPU usage while the host shows the vCPU is not scheduled. A guest can show block I/O timeout while the host shows a backend queue stall.

ftrace and tracefs

ftrace is the built-in kernel tracing framework exposed through tracefs, usually mounted at /sys/kernel/tracing.

Start by checking the surface:

mount -t tracefs nodev /sys/kernel/tracing 2>/dev/null || true
cd /sys/kernel/tracing
cat available_tracers
cat available_events | head
cat tracing_on

Function graph tracing is useful when you know the function family:

echo 0 > tracing_on
echo function_graph > current_tracer
echo '*read*' > set_ftrace_filter
echo 1 > tracing_on
cat /path/to/file >/dev/null
echo 0 > tracing_on
cat trace > /tmp/read.trace

Event tracing is safer and usually more stable:

echo 0 > tracing_on
echo > trace
echo 1 > events/sched/sched_switch/enable
echo 1 > events/irq/irq_handler_entry/enable
echo 1 > events/irq/irq_handler_exit/enable
echo 1 > tracing_on
sleep 3
echo 0 > tracing_on
cat trace > /tmp/events.trace

For boot problems, boot-time tracing can start earlier than a manual shell session. For latency problems, kernel 6.x tracing documentation includes tracers such as osnoise, timerlat, and hwlat. Those are especially relevant on systems where missed latency budgets are failures, not just performance noise.

A good low-level trace includes:

  • scheduler switches,
  • IRQ entry and exit,
  • softirq,
  • workqueue execution,
  • power and clock events,
  • block or network tracepoints if the symptom is I/O,
  • driver-specific tracepoints if available,
  • host/hypervisor events for vCPU scheduling and virtual interrupts.

BPF as controlled instrumentation

BPF is what I reach for when printk is too blunt, ftrace is too broad, and GDB changes the system too much.

The mechanism is:

  1. Compile a BPF program.
  2. Load it through the bpf syscall.
  3. Let the verifier check safety.
  4. Attach it to tracepoints, kprobes, uprobes, perf events, LSM hooks, or other supported attachment points.
  5. Move results through maps, perf buffers, or BPF ring buffers.

For quick experiments, bpftrace is the fastest path.

List syscall tracepoints:

bpftrace -l 'tracepoint:syscalls:sys_enter_*'

Count syscalls by process:

bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

Trace open paths:

bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'

Measure read latency:

bpftrace -e 'kprobe:vfs_read { @start[tid] = nsecs; } kretprobe:vfs_read /@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'

Profile kernel stacks:

bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'

Good BPF debugging attaches to request setup and completion, carries a request pointer through a map, then emits latency and operation type. For modern kernels, prefer tracepoints when they exist, because kprobe targets can change across versions. Use kprobes when the internal function is exactly the hypothesis.

Block I/O with stable tracepoints:

bpftrace -e 'tracepoint:block:block_rq_issue { @bytes = hist(args->bytes); }'
bpftrace -e 'tracepoint:block:block_rq_complete { @lat[args->dev] = count(); }'

When writing a reusable tool, use libbpf, BTF, and CO-RE instead of embedding kernel headers at runtime. The running kernel exposes BTF at:

ls -l /sys/kernel/btf/vmlinux
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
bpftool prog list
bpftool map list

CO-RE matters for kernel 6.x and vendor kernels because structure layouts vary. It does not make unstable probe points stable. It only makes type access more portable when the target field still exists and BTF is available.

BPF ring buffers are useful when event order matters. Compared with per-CPU perf buffers, a shared BPF ring buffer can preserve ordering across CPUs for sequential events such as fork, exec, exit, request issue, and completion. That matters for guest debugging because cross-CPU order is often the bug.

Guest BPF constraints

On managed or restricted guests, the available BPF surface depends on build type, kernel config, security policy, and operator restrictions. Treat BPF as a platform engineering tool, not as something always available on production systems.

For debug or engineering builds:

mount -t tracefs nodev /sys/kernel/tracing 2>/dev/null
cat /sys/kernel/tracing/available_events | head
bpftool prog list
cat /sys/kernel/btf/vmlinux >/tmp/vmlinux.btf

For platform bugs:

  • use userspace logs to locate the service or native symptom,
  • use core dumps or crash records for native crashes,
  • use ftrace or BPF for scheduler, IPC, IRQ, block, or driver timing,
  • use pstore/ramoops for reboot and panic evidence,
  • use host or hypervisor traces when the guest cannot observe the event.

IPC and framework behavior can hide behind system calls. A pure syscall trace may not explain why a service is blocked. In that case, combine userspace tags, IPC state, scheduler events, and kernel wait paths.

Hypervisor guest debugging

For a guest OS on a hypervisor, every record should be labeled by layer:

LayerEvidence
Guest userspacejournald, core dumps, service logs, state snapshots
Guest kerneldmesg, pstore, ftrace, BPF, procfs/sysfs
Virtual platformvirtio queues, MMIO traps, injected IRQs, virtual timers
Host or management VMdevice model logs, backend storage/network errors
HypervisorvCPU scheduling, stage-2 faults, traps, reset reason
Hardware or firmwarewatchdog, power, secure monitor, crash dump
Linux userspace journal, core, service logs, snapshots guest kernel printk, pstore, ftrace, BPF, procfs virtual device boundary MMIO, virtqueue, IRQ injection, timers host / management VM backend device, storage, net, console capture hypervisor / firmware vCPU run state, traps, stage-2 faults, reset reason

Common mistake: trusting only the guest. If the guest says "block timeout", that can mean the guest driver is broken, but it can also mean the backend never completed the request, the virtual interrupt was not injected, the vCPU was not scheduled, or the guest clock jumped.

For a failure report, I want at least:

guest dmesg with monotonic timestamps
guest pstore after reboot
core dump or crash record, if native crash
guest /proc/interrupts and /proc/iomem
host/hypervisor event trace covering the same window
virtual device queue or backend logs
reset reason
clocksource and timestamp basis for each log

Playbooks

Boot loop before userspace

Use surfaces that do not require userspace:

earlycon
serial console
dmesg from bootloader or monitor capture
/sys/fs/pstore/console-ramoops-0 after reboot
hypervisor reset reason
host virtual console capture

Questions:

  • Did the kernel start?
  • Did the console start?
  • Did init start?
  • Did the reboot come from watchdog, panic, explicit reset, or hypervisor policy?
  • Is the pstore area valid and not overwritten?

Native service crash

Collect:

journalctl -b --no-pager > journal.txt
coredumpctl list
coredumpctl info <PID-or-COREDUMP>
coredumpctl dump <PID-or-COREDUMP> > core.dump

Then symbolize with matching build artifacts. Do not stop at the last userspace log line. The crash record or core dump is the actual crash artifact.

I/O latency or timeout

Start with count and profile:

perf stat -e task-clock,context-switches,page-faults -a -- sleep 10
cat /proc/interrupts
cat /proc/softirqs

Then trace:

trace-cmd record -e block:block_rq_issue -e block:block_rq_complete -e irq:* sleep 10
bpftrace -e 'tracepoint:block:block_rq_issue { @bytes = hist(args->bytes); }'

In a guest, pair this with host or hypervisor queue events. A guest-side request issue without host-side completion is a boundary problem.

vCPU starvation

Guest symptoms:

  • soft lockup,
  • RCU stall,
  • timer jump,
  • delayed workqueue,
  • long scheduling gap.

Guest evidence:

dmesg | grep -iE 'soft lockup|rcu|watchdog|sched|clocksource'
trace-cmd record -e sched:sched_switch -e irq:* sleep 10
cat /proc/schedstat

Host evidence:

vCPU run/exit trace
physical CPU load
IRQ routing
timer injection
hypervisor scheduling class

RAM corruption

First decide whether the evidence is live or postmortem.

Live:

cat /proc/slabinfo | head
cat /proc/buddyinfo
cat /proc/vmstat | grep -E 'pgfault|oom|compact|allocstall'
cat /sys/kernel/debug/page_owner 2>/dev/null | head

Postmortem:

pstore panic record
vmcore
raw RAM dump
KASAN report
KFENCE report
page_owner snapshot
hypervisor stage-2 fault trace

For raw RAM, preserve the exact kernel image and symbol set. If KASLR is enabled, record the load offset or extract it from the dump. If the platform has reserved regions, include firmware tables or device tree data when relevant, and always keep /proc/iomem with the dump.

Minimal capture sets

Linux guest

uname -a
cat /etc/os-release 2>/dev/null
cat /proc/cmdline
cat /proc/iomem
cat /proc/interrupts
cat /proc/softirqs
dmesg --time-format=iso --decode > dmesg.txt
journalctl -b --no-pager > journal.txt 2>/dev/null
journalctl -k -b --no-pager > journal-kernel.txt 2>/dev/null
ls -l /sys/fs/pstore 2>/dev/null
cp -a /sys/fs/pstore pstore 2>/dev/null || true
findmnt
lsblk -f
coredumpctl list 2>/dev/null

BPF readiness

zcat /proc/config.gz 2>/dev/null | grep -E 'BPF|BTF|KPROBE|FTRACE|TRACEPOINT'
ls -l /sys/kernel/btf/vmlinux
bpftool feature probe kernel
bpftool prog list
bpftool map list
bpftrace -l 'tracepoint:sched:*' | head

pstore readiness

zcat /proc/config.gz 2>/dev/null | grep -E 'PSTORE|RAMOOPS'
cat /proc/iomem | grep -iE 'pstore|ramoops|reserved'
ls -l /sys/fs/pstore
grep -R . /sys/module/ramoops/parameters 2>/dev/null

Reading output well

An output record is not a cause. It is a statement emitted by one layer at one time through one surface.

Read it with these questions:

  1. Which layer emitted this?
  2. Was the layer still healthy when it emitted it?
  3. Is the timestamp guest time, host time, monotonic time, or wall time?
  4. Could earlier records have been overwritten?
  5. Did instrumentation change timing?
  6. Does another layer confirm the same transition?
  7. Is this a symptom after the failure, or the first observable state change?

The right evidence set is the one that can survive the failure mode. If userspace dies, use kernel and RAM-backed records. If the kernel dies, use pstore, serial, hypervisor, or dump. If the guest lies because it is not running, use host and hypervisor traces.

References