vmm.dev

kernel patch

kernel patchkernel patchvirtualiz...virtualized timesystem ou...system outputLinuxLinuxlinux tro...linux troubleshootingシステム出力システム出力vmm.devvmm.dev

kernel patch

Chasing a missing TSC clocksource in a KVM guest

I originally wrote this after chasing a small kernel timing problem: a CentOS 7 guest running on KVM did not list tsc in available_clocksource, while a comparable guest on Xen did.

The interesting part was not the final patch. The interesting part was the path: compare two systems, read the boot messages, instrument the kernel with printk, build a distribution kernel, and then discover that upstream had already fixed the same class of problem more cleanly.

The background note is virtualized time. The evidence note is system output.

Symptom

On the KVM guest, the clocksource list looked like this:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock acpi_pm

The missing entry was tsc. The kernel did see the processor TSC as a device-level feature:

$ dmesg | grep -iE '(clocksource|tsc)'
[    0.000000] tsc: Detected 3000.000 MHz processor
[    1.834422] TSC deadline timer enabled
[    2.600920] Switched to clocksource kvm-clock

That output suggested a narrow problem. The TSC itself existed. The guest selected kvm-clock as the active clocksource. What failed was registration of tsc as an available clocksource.

Comparison with Xen

The first useful step was not editing the kernel. It was comparing a similar system where the symptom did not appear.

On a Xen guest:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
xen tsc hpet acpi_pm

$ dmesg | grep -iE '(clocksource|tsc)'
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2394.486 MHz processor
[    0.031000] tsc: Detected 2394.476 MHz TSC
[    0.607022] Switched to clocksource xen
[    2.244080] tsc: Refined TSC clocksource calibration: 2394.455 MHz

The important difference was the refined calibration message. The KVM guest reported that it detected a processor TSC, but it never reached the later message that says the TSC clocksource calibration was refined and registered.

At this point the hypothesis was still weak:

The TSC is present, but the delayed refinement path is not completing.

That is a useful hypothesis because it points to a small part of the x86 timekeeping code.

Source preparation

The old environment was CentOS 7 with a distribution kernel around 3.10.0-957.el7. The source preparation path used RPM tooling:

sudo yumdownloader --source kernel-3.10.0-957.el7
sudo yum groupinstall "Development Tools" -y
sudo yum install rpmdevtools -y
rpmdev-setuptree
rpm -Uvh kernel-3.10.0-957.el7.src.rpm
sudo yum-builddep -y --enablerepo=* rpmbuild/SPECS/kernel.spec
rpmbuild -bp ~/rpmbuild/SPECS/kernel.spec

The source tree was copied twice so the patch could be generated as a clean diff:

cp -r ~/rpmbuild/BUILD/kernel-3.10.0-957.el7 \
  ~/rpmbuild/BUILD/kernel-3.10.0-957.el7.orig

cp -al ~/rpmbuild/BUILD/kernel-3.10.0-957.el7.orig \
  ~/rpmbuild/BUILD/kernel-3.10.0-957.el7.new

The files that mattered were the usual x86 timekeeping files:

arch/x86/kernel/tsc.c
arch/x86/kernel/kvmclock.c
arch/x86/kernel/pvclock.c
kernel/time/clocksource.c

Modern note: for current kernels, I would first check tracepoints, dynamic debug, ftrace, and BPF-capable tracing before rebuilding the kernel. But for an early boot path in an older distribution kernel, small printk markers are still a reasonable baseline.

A small printk marker

The investigation used a deliberately simple macro:

#define __dprintk(n) \
    printk(KERN_INFO "No.%s, func: %s, line: %d, file: %s\n", \
           #n, __FUNCTION__, __LINE__, __FILE__)

The point is not elegance. The point is to make boot logs answer a direct question:

Which branches did this kernel actually execute?

That matters because boot-time clocksource registration happens before normal user-space observability is available. The log becomes the debugger.

Entry point

In the old kernel, TSC clocksource registration begins in init_tsc_clocksource:

static int __init init_tsc_clocksource(void)
{
    if (!cpu_has_tsc || tsc_disabled > 0 || !tsc_khz)
        return 0;

    if (tsc_clocksource_reliable)
        clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;

    if (check_tsc_unstable()) {
        clocksource_tsc.rating = 0;
        clocksource_tsc.flags &= ~CLOCK_SOURCE_IS_CONTINUOUS;
    }

    if (boot_cpu_has(X86_FEATURE_NONSTOP_TSC_S3))
        clocksource_tsc.flags |= CLOCK_SOURCE_SUSPEND_NONSTOP;

    if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {
        clocksource_register_khz(&clocksource_tsc, tsc_khz);
        return 0;
    }

    schedule_delayed_work(&tsc_irqwork, 0);
    return 0;
}

device_initcall(init_tsc_clocksource);

There are two obvious paths:

  1. If the TSC is already trusted, register it immediately.
  2. Otherwise, schedule delayed work that refines the calibration against HPET or ACPI PM.

The KVM guest was not taking the immediate path. The delayed work became the next target.

Marking the path

The markers were inserted around the branches:

static int __init init_tsc_clocksource(void)
{
    if (!cpu_has_tsc || tsc_disabled > 0 || !tsc_khz)
        return 0;
__dprintk(1);

    if (tsc_clocksource_reliable)
        clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;

    if (check_tsc_unstable()) {
        clocksource_tsc.rating = 0;
        clocksource_tsc.flags &= ~CLOCK_SOURCE_IS_CONTINUOUS;
    }

    if (boot_cpu_has(X86_FEATURE_NONSTOP_TSC_S3))
        clocksource_tsc.flags |= CLOCK_SOURCE_SUSPEND_NONSTOP;

    if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {
        clocksource_register_khz(&clocksource_tsc, tsc_khz);
        return 0;
    }
__dprintk(2);
    schedule_delayed_work(&tsc_irqwork, 0);
__dprintk(3);
    return 0;
}

Then the delayed work was marked:

static DECLARE_DELAYED_WORK(tsc_irqwork, tsc_refine_calibration_work);

static void tsc_refine_calibration_work(struct work_struct *work)
{
    static u64 tsc_start = -1, ref_start;
    static int hpet;
    u64 tsc_stop, ref_stop, delta;
    unsigned long freq;

    if (check_tsc_unstable())
        goto out;
__dprintk(4);

    if (tsc_start == -1) {
__dprintk(5);
        hpet = is_hpet_enabled();
        schedule_delayed_work(&tsc_irqwork, HZ);
        tsc_start = tsc_read_refs(&ref_start, hpet);
        return;
    }
__dprintk(6);

    tsc_stop = tsc_read_refs(&ref_stop, hpet);

    if (ref_start == ref_stop)
        goto out;
__dprintk(7);

    if (tsc_start == ULLONG_MAX || tsc_stop == ULLONG_MAX)
        goto out;
__dprintk(8);

    delta = tsc_stop - tsc_start;
    delta *= 1000000LL;
    if (hpet)
        freq = calc_hpet_ref(delta, ref_start, ref_stop);
    else
        freq = calc_pmtimer_ref(delta, ref_start, ref_stop);

    if (abs(tsc_khz - freq) > tsc_khz / 100)
        goto out;
__dprintk(9);

    tsc_khz = freq;
    pr_info("Refined TSC clocksource calibration: %lu.%03lu MHz\n",
            (unsigned long)tsc_khz / 1000,
            (unsigned long)tsc_khz % 1000);

out:
    if (boot_cpu_has(X86_FEATURE_ART))
        art_related_clocksource = &clocksource_tsc;
__dprintk(10);
    clocksource_register_khz(&clocksource_tsc, tsc_khz);
}

This is noisy and ugly. It is also effective. The markers form a tiny execution trace.

Building the test kernel

The edited tree was turned into a distribution patch:

cd ~/rpmbuild/BUILD
diff -uNrp kernel-3.10.0-957.el7.orig \
  kernel-3.10.0-957.el7.new \
  > ../SOURCES/linux-3.10.0-957.el7.patch

Then the kernel spec was adjusted to apply the patch and give the build a visible ID:

%define buildid .tscheck
Patch1000: linux-3.10.0-957.el7.patch
ApplyOptionalPatch linux-3.10.0-957.el7.patch

The build/install step:

rpmbuild -bb --with baseonly \
  --without debuginfo \
  --without debug \
  --without doc \
  --without perf \
  --without tools \
  --without kdump \
  --without bootwrapper \
  --target=$(uname -m) kernel.spec

sudo yum localinstall -y ~/rpmbuild/RPMS/x86_64/kernel-*.rpm

This is heavy-handed, but it preserves the distribution kernel packaging shape.

What printk showed

The instrumented kernel produced the useful clue:

$ dmesg | grep -iE '(clocksource|tsc|No\\.)'
[    0.000000] Linux version 3.10.0-957.el7.tscheck.x86_64 ...
[    0.000000] tsc: Detected 3000.000 MHz processor
[    1.804286] TSC deadline timer enabled
[    2.557401] Switched to clocksource kvm-clock
[    3.035763] No.1, func: init_tsc_clocksource, line: 1309, file: arch/x86/kernel/tsc.c
[    3.044996] No.2, func: init_tsc_clocksource, line: 1330, file: arch/x86/kernel/tsc.c
[    3.054436] No.3, func: init_tsc_clocksource, line: 1332, file: arch/x86/kernel/tsc.c
[    3.063727] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[    3.073240] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
[    4.083424] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[    4.092902] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
[    5.085423] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[    5.085424] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c

The trace repeated No.4 and No.5 forever. It never reached No.6.

That means the work function kept re-entering this block:

if (tsc_start == -1) {
    hpet = is_hpet_enabled();
    schedule_delayed_work(&tsc_irqwork, HZ);
    tsc_start = tsc_read_refs(&ref_start, hpet);
    return;
}

The real question became:

Why does tsc_start keep behaving like -1?

tsc_read_refs

The answer was in tsc_read_refs:

#define MAX_RETRIES     5
#define SMI_TRESHOLD    50000

static u64 tsc_read_refs(u64 *p, int hpet)
{
    u64 t1, t2;
    int i;

    for (i = 0; i < MAX_RETRIES; i++) {
        t1 = get_cycles();
        if (hpet)
            *p = hpet_readl(HPET_COUNTER) & 0xFFFFFFFF;
        else
            *p = acpi_pm_read_early();
        t2 = get_cycles();
        if ((t2 - t1) < SMI_TRESHOLD)
            return t2;
    }
    return ULLONG_MAX;
}

tsc_start was not literally a normal -1 state. The function was returning ULLONG_MAX, which is the all-ones value. The code used that value as a failure sentinel.

The intended operation is:

  1. Read the TSC into t1.
  2. Read a reference counter, either HPET or ACPI PM.
  3. Read the TSC again into t2.
  4. If the read interval was short enough, accept t2.
  5. Otherwise, retry a few times and return failure.

The fixed threshold was the weak point. If the TSC count advances too much while the reference counter is being read, every retry fails. In a virtualized environment, that can happen because the guest-observed I/O path to an emulated reference counter can be slow relative to TSC.

Local cause

The local cause was:

The TSC/reference-counter sample was rejected every time.
The delayed calibration work kept rescheduling itself.
The final TSC clocksource registration path never completed.
No useful kernel message explained that this was happening.

That matched the original symptom exactly: tsc existed as hardware, but not as an available Linux clocksource.

Patch experiment 1: raise the threshold

The fastest experiment was crude:

-#define SMI_TRESHOLD    50000
+#define SMI_TRESHOLD    5000000

It was not a good patch, but it was a good experiment. The result:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock tsc acpi_pm

$ dmesg | grep -iE '(clocksource|tsc|No\\.)'
[    0.000000] tsc: Detected 3000.000 MHz processor
[    1.835560] TSC deadline timer enabled
[    2.605330] Switched to clocksource kvm-clock
[    3.114963] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[    3.124533] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
[    4.219336] No.6, func: tsc_refine_calibration_work, line: 1266, file: arch/x86/kernel/tsc.c
[    4.229233] No.7, func: tsc_refine_calibration_work, line: 1273, file: arch/x86/kernel/tsc.c
[    4.239024] No.8, func: tsc_refine_calibration_work, line: 1278, file: arch/x86/kernel/tsc.c
[    4.248844] No.9, func: tsc_refine_calibration_work, line: 1290, file: arch/x86/kernel/tsc.c
[    4.258687] tsc: Refined TSC clocksource calibration: 3000.004 MHz
[    4.264562] No.10, func: tsc_refine_calibration_work, line: 1300, file: arch/x86/kernel/tsc.c

The experiment proved the suspected branch. It did not prove that 5000000 was a safe value. A larger fixed threshold can hide real disturbances, and clocksource code is exactly the wrong place to hide real timing problems.

Patch experiment 2: trust KVM frequency information

The other experiment was KVM-specific. The KVM paravirtual clock path already knows a TSC frequency through pvclock data:

static unsigned long kvm_get_tsc_khz(void)
{
    struct pvclock_vcpu_time_info *src;
    int cpu;
    unsigned long tsc_khz;

    src = &hv_clock[cpu].pvti;
    tsc_khz = pvclock_tsc_khz(src);
    preempt_enable();
    return tsc_khz;
}

void __init kvmclock_init(void)
{
    x86_platform.calibrate_tsc = kvm_get_tsc_khz;
    x86_platform.calibrate_cpu = kvm_get_tsc_khz;
}

The experiment was to force the TSC-reliable capability when KVM clock initialization provided the calibration path:

@@ -338,6 +338,7 @@ void __init kvmclock_init(void)

     x86_platform.calibrate_tsc = kvm_get_tsc_khz;
     x86_platform.calibrate_cpu = kvm_get_tsc_khz;
+    setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
     x86_platform.get_wallclock = kvm_get_wallclock;
     x86_platform.set_wallclock = kvm_set_wallclock;

The test result:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock tsc acpi_pm

$ dmesg | grep -iE '(clocksource|tsc)'
[    0.000000] tsc: Detected 3000.000 MHz processor
[    1.832686] TSC deadline timer enabled
[    1.929653] Skipped synchronization checks as TSC is reliable.
[    2.598602] Switched to clocksource kvm-clock

Again, the experiment proved a direction. It did not automatically make the patch correct for upstream or for every virtualization setup.

Upstream had the better patches

Distribution kernels are often older than upstream. After the local cause was understood, the right move was to search upstream history.

The threshold idea had a much better upstream version:

x86/tsc: Make calibration refinement more robust

The important shape was:

-#define MAX_RETRIES     5
-#define SMI_TRESHOLD    50000
+#define MAX_RETRIES         5
+#define TSC_DEFAULT_THRESHOLD 0x20000

 static u64 tsc_read_refs(u64 *p, int hpet)
 {
     u64 t1, t2;
+    u64 thresh = tsc_khz ? tsc_khz >> 5 : TSC_DEFAULT_THRESHOLD;
     int i;

     for (i = 0; i < MAX_RETRIES; i++) {
         t1 = get_cycles();
         ...
         t2 = get_cycles();
-        if ((t2 - t1) < SMI_TRESHOLD)
+        if ((t2 - t1) < thresh)
             return t2;
     }
     return ULLONG_MAX;
 }

This is the difference between a local proof and a real fix. The threshold scales with the calibrated TSC frequency instead of pretending one fixed number fits every system.

There was also a KVM-related upstream direction:

kvmclock: fix TSC calibration for nested guests

The relevant shape:

 static unsigned long kvm_get_tsc_khz(void)
 {
     src = &hv_clock[cpu].pvti;
     tsc_khz = pvclock_tsc_khz(src);
     put_cpu();
+    setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
     return tsc_khz;
 }

The upstream capability name and exact conditions differ from the local experiment, which is exactly the point. Kernel patches carry architecture, migration, nested virtualization, CPU feature, and regression concerns that are easy to miss from one local reproduction.

What changed since the original note

The original note was centered on printk and source reading. That is still useful, but a current workflow should check less invasive tools first when the code path permits it.

Useful modern tools and interfaces include:

  • dynamic debug for enabling pr_debug() and dev_dbg() callsites at runtime,
  • ftrace and tracefs for function tracing and event tracing,
  • boot-time tracing through bootconfig when the interesting path happens before normal user space,
  • BPF tracing when a stable attachment point exists and the system policy allows it,
  • and distribution debuginfo plus crash dumps when a failure produces a dumpable state.

For this specific old failure, printk was still a good fit because the suspect path ran early, the branch was simple, and the question was binary: does the delayed work get past this block?

Lessons

The durable lessons are not CentOS 7-specific:

  1. Compare a failing system with a similar healthy one before patching.
  2. Treat missing log output as evidence. Silent paths are often where the bug is.
  3. Instrument branches, not feelings.
  4. Separate a patch that proves a hypothesis from a patch that should ship.
  5. Search upstream after you understand the failure well enough to search for the right words.
  6. Do not trust one virtualized timing result as a universal timing rule.

The story ends in a humble place: the local patch experiments were useful, but upstream had the cleaner fixes. That is a good outcome. Kernel debugging is not only about inventing patches; it is also about learning enough to recognize the right patch when you find it.

References

See also

Related: Linux, virtualized time, system output, linux troubleshooting.