linux troubleshooting

Linux troubleshooting as hypothesis management

This note is my troubleshooting stance in a reusable form. Problem solving is not a catalog of fixes. It is a way to move from a symptom to a verified cause without confusing assumptions, observations, tests, workarounds, and fixes.

The companion note system output collects the evidence interfaces. The kernel debugging story kernel patch is a concrete example of the same method at a lower layer.

The stance

When a Linux system fails, the first important question is not "what command should I run?" The first question is:

What do I actually know?

A symptom can look obvious and still be misleading. A service error might be the first component that noticed the problem, not the cause. A kernel warning might be a secondary effect. A boot hang might be a storage, network, firmware, initramfs, or service ordering problem.

The original note was blunt about this: do not try to memorize every possible incident pattern. Cloud platforms, kernel versions, distributions, firmware, services, and configuration combinations change too quickly. The durable skill is not a list of answers. The durable skill is how to turn noisy evidence into a smaller search space.

Notice the problem

A problem is only operationally real after someone notices it and decides to deal with it. That sounds trivial, but many incidents begin as small signals that were visible earlier:

log volume changes,
latency shape changes,
repeated but ignored warnings,
one host behaving differently from its peers,
boot messages that changed after an update,
a unit that restarts but eventually comes up,
memory pressure that is treated as normal because the service still responds.

The first loop is therefore simple:

notice -> collect -> compare -> decide whether it is abnormal

You need enough understanding of the system to know what "normal" looks like. Without that, every warning becomes either terrifying or invisible.

Separate observation from interpretation

Write down the difference between facts and interpretations.

Type	Example
observation	`journalctl -b` shows `EXT4-fs error` at 03:12
interpretation	the disk is failing
test	compare SMART data, controller logs, and kernel I/O errors
conclusion	the block device disappeared during a controller reset
workaround	move the workload away from that host
fix	replace hardware, driver, firmware, or configuration

This protects the investigation from circular reasoning. "The disk is broken because the logs show disk errors" is not enough. Which layer produced the error? Did the device vanish? Did the filesystem reject a corrupt structure? Did a timeout occur under load? Did the storage network flap?

Good troubleshooting keeps those questions separate.

Two phases

The original note divided problem solving into two broad phases:

Investigation: gather data, analyze it, check reproducibility, and identify the cause or a useful boundary.
Execution: apply a workaround or fix, then verify that the observed condition actually changed.

That division is still useful. The failure mode is to jump directly to execution because a command feels familiar. Rebooting, restarting a service, changing a timeout, or deleting a lock file might restore service, but if it erases the only evidence, it also weakens the next investigation.

In production, you may still need the fast workaround. The discipline is to say what you are doing:

This is a recovery action, not a root-cause conclusion.

Three layers of thinking

The original note used three categories:

Strategy: the direction that keeps the investigation moving.
Method: a general way to reduce the problem.
Tools: specific commands, interfaces, and techniques.

For example:

Layer	Example
strategy	find the layer where the failure begins
method	break the boot path into stages
tools	inspect `journalctl -b -1`, kernel command line, and initramfs logs

Methods and tools are useful only when they serve the strategy. Running more commands does not mean the investigation is improving. It might only mean the evidence pile is getting larger.

Direction finding

The most important strategy is direction finding. List the conditions, even the ones that look unrelated.

For a boot issue:

hardware or virtual machine type,
firmware and boot order,
bootloader configuration,
kernel version and command line,
initramfs contents,
root filesystem identity,
storage and network dependencies,
systemd units and ordering,
recent package or configuration changes,
logs from successful and failed boots,
console output,
desired outcome.

Then classify:

assumptions: known system state and configuration
observations: what was actually seen
unknowns: what still needs evidence
goal: recovery, root cause, prevention, or proof

The investigation tries to connect assumptions to the goal with evidence.

Work backward from the goal

Sometimes the fastest path is to ask what condition immediately precedes success.

If the goal is "the system must always boot", the immediate condition might be:

The boot path must not block on unavailable NFS mounts.

That does not fully explain the kernel behavior, but it may produce a safe workaround while the deeper investigation continues. This is not cheating. It is separating service recovery from root-cause analysis.

Experiment deliberately

Technical problems often require experiments, but random changes are expensive. A useful experiment changes one dimension and predicts what should happen.

Examples:

reduce NFS mount entries from four to one,
boot with one kernel version and then another,
remove one kernel command-line option,
reproduce with one service disabled,
compare one failing host with one healthy host,
move from the application layer down to storage or network only after evidence points there.

The rule is:

Before changing something, say what result would support or weaken the hypothesis.

Useful methods

The original note named several methods. They still hold up.

Max and min

Look at boundaries and extremes:

first failure time,
last known good time,
smallest reproducer,
largest affected set,
most loaded host,
least loaded host,
fastest failure,
slowest failure,
newest package,
oldest kernel.

Extremes reveal shape. Shape is often more useful than volume.

Specialize and generalize

If many systems fail, find one system that does not. If one system fails, find the smallest configuration that still fails.

Specialization asks:

What exact condition is necessary for the failure?

Generalization asks:

What broader class does this condition belong to?

The NFS example below uses both.

Break down the path

Do not troubleshoot "Linux" as one object. Split the path:

firmware -> bootloader -> kernel -> initramfs -> root fs -> init -> services -> application

For a network service:

client -> DNS -> routing -> firewall -> listener -> service -> dependency -> storage

Each arrow is a place to collect evidence.

Use invariants

Some things should not change:

a monotonic clock should not go backward,
a process should have the expected parent,
a mount should point to the expected device,
a service should run with the expected environment,
a route should match the intended interface,
a kernel should boot with the expected command line.

An invariant violation is often more useful than a direct error message.

Move dimensions

Move up and down the stack. If application logs are vague, inspect system logs. If system logs point to storage, inspect block devices. If block devices look fine, check the hypervisor or hardware. If the kernel path is unclear, instrument it.

This is the link to kernel patch. Kernel debugging is the same reasoning under lower-level constraints.

Example: intermittent boot failure

The original note used a deliberately simple example: "The OS sometimes does not boot."

That report is too vague. It says the machine becomes unavailable, but it does not say where the boot stops.

A better investigation starts by listing facts:

The system was not under high CPU, disk, or network load before reboot.
Persistent journal logs were not available for the failed boot.
The console showed BUG: soft lockup and repeated call traces.
The instruction pointer in the trace pointed around nfs4_state_manager.
There were also RCU stall messages.
The system had one root filesystem entry and several NFS entries in /etc/fstab.
The desired outcome was reliable boot.

The early hypothesis became:

The boot hang is related to NFS state handling, not generic load.

That was still broad, but it was better than "Linux does not boot."

Experiment: number of NFS entries

The experiment changed the number of NFS mount entries:

NFS entries	Reproduced
1	no
4	yes
2	yes

This did not prove the root cause. It did produce a very useful boundary:

Multiple NFS entries are necessary for this reproduction.

That immediately gives a workaround:

Do not put two or more affected NFS entries in the boot path.

It also gives better search terms and a better upstream bug search:

soft lockup nfs4_state_manager multiple nfs fstab boot

In the original story, the issue had already been fixed by a kernel update. The important lesson was not "NFS is bad." The lesson was that a single missed condition, the number of NFS entries, became visible only after a controlled experiment.

Evidence bundle

For many Linux incidents, a first evidence bundle can be small:

uname -a
cat /etc/os-release
cat /proc/cmdline
journalctl -b -p warning --no-pager
dmesg -T
systemctl --failed
lsblk -f
findmnt
ip addr
ip route

For service issues:

systemctl status name.service
systemctl cat name.service
journalctl -u name.service -b --no-pager
systemctl list-dependencies name.service

For storage and filesystem issues:

lsblk -o NAME,MAJ:MIN,SIZE,FSTYPE,FSVER,MOUNTPOINTS,UUID
findmnt -R /
dmesg -T | grep -iE 'error|fail|timeout|reset|nvme|scsi|blk|ext4|xfs'

For timekeeping or clocksource issues:

cat /sys/devices/system/clocksource/clocksource0/current_clocksource
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
cat /proc/cpuinfo
dmesg | grep -iE 'clocksource|tsc|hpet|acpi_pm|kvm-clock|xen'

That last bundle is exactly where virtualized time and kernel patch meet.

How to read logs

Log reading should not be passive. Ask these questions:

What is the first abnormal message, not the loudest one?
What component produced it?
Is the message a cause, a consequence, or a reporter?
What happened immediately before it?
Does the same message appear on healthy systems?
Did time jump, did the machine reboot, or did log persistence begin late?

The first abnormal message is often earlier than the selected error. This is why persistent logs and console capture matter.

When to stop digging

Not every incident deserves full root-cause analysis. A useful conclusion can be:

We do not yet know the source-level root cause, but we know the necessary condition and have removed it from production.

That is valid if it includes residual risk and a follow-up trigger.

A weak conclusion is:

The server was broken, so we rebooted it.

A better conclusion is:

The machine hung during boot when two or more affected NFS mounts were present in `/etc/fstab`. Console output showed repeated `nfs4_state_manager` soft lockups. Reducing the entries removed the reproduction, and updating the kernel matched an upstream fix for the same symptom class.

The difference is that the second version can guide the next operator.

Modern adjustments

Some old commands and habits still work, but current systems often have better interfaces:

Prefer journalctl -k -b for kernel logs when journald has captured them.
Use journalctl --list-boots before assuming previous boot logs exist.
Use systemctl cat and dependency queries instead of guessing unit files.
Use findmnt and lsblk instead of parsing only /etc/fstab.
Use tracefs, dynamic debug, and BPF-based tools when they fit the failure path.
Keep console capture for early boot and panic paths, because journald may start too late.

The principle does not change: collect evidence that answers a specific hypothesis.

Summary

Linux troubleshooting is not a bag of commands. It is a discipline:

Notice the abnormal state.
Separate facts from interpretations.
Choose the layer to investigate.
Gather evidence that can change the hypothesis.
Run controlled experiments.
Apply the workaround or fix.
Verify that the original observation changed.
Write the conclusion in a way the next operator can reuse.

That is why this note lives between system output and kernel patch. One note explains where evidence comes from. The other shows the same reasoning inside kernel source.

linux troubleshooting

linux troubleshooting