kernel patch
Chasing a missing TSC clocksource in a KVM guest
I originally wrote this after chasing a small kernel timing problem: a CentOS 7 guest running on KVM did not list tsc in available_clocksource, while a comparable guest on Xen did.
The interesting part was not the final patch. The interesting part was the path: compare two systems, read the boot messages, instrument the kernel with printk, build a distribution kernel, and then discover that upstream had already fixed the same class of problem more cleanly.
The background note is virtualized time. The evidence note is system output.
Symptom
On the KVM guest, the clocksource list looked like this:
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock acpi_pm
The missing entry was tsc. The kernel did see the processor TSC as a device-level feature:
$ dmesg | grep -iE '(clocksource|tsc)'
[ 0.000000] tsc: Detected 3000.000 MHz processor
[ 1.834422] TSC deadline timer enabled
[ 2.600920] Switched to clocksource kvm-clock
That output suggested a narrow problem. The TSC itself existed. The guest selected kvm-clock as the active clocksource. What failed was registration of tsc as an available clocksource.
Comparison with Xen
The first useful step was not editing the kernel. It was comparing a similar system where the symptom did not appear.
On a Xen guest:
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
xen tsc hpet acpi_pm
$ dmesg | grep -iE '(clocksource|tsc)'
[ 0.000000] tsc: Fast TSC calibration using PIT
[ 0.000000] tsc: Detected 2394.486 MHz processor
[ 0.031000] tsc: Detected 2394.476 MHz TSC
[ 0.607022] Switched to clocksource xen
[ 2.244080] tsc: Refined TSC clocksource calibration: 2394.455 MHz
The important difference was the refined calibration message. The KVM guest reported that it detected a processor TSC, but it never reached the later message that says the TSC clocksource calibration was refined and registered.
At this point the hypothesis was still weak:
The TSC is present, but the delayed refinement path is not completing.
That is a useful hypothesis because it points to a small part of the x86 timekeeping code.
Source preparation
The old environment was CentOS 7 with a distribution kernel around 3.10.0-957.el7. The source preparation path used RPM tooling:
sudo yumdownloader --source kernel-3.10.0-957.el7
sudo yum groupinstall "Development Tools" -y
sudo yum install rpmdevtools -y
rpmdev-setuptree
rpm -Uvh kernel-3.10.0-957.el7.src.rpm
sudo yum-builddep -y --enablerepo=* rpmbuild/SPECS/kernel.spec
rpmbuild -bp ~/rpmbuild/SPECS/kernel.spec
The source tree was copied twice so the patch could be generated as a clean diff:
cp -r ~/rpmbuild/BUILD/kernel-3.10.0-957.el7 \
~/rpmbuild/BUILD/kernel-3.10.0-957.el7.orig
cp -al ~/rpmbuild/BUILD/kernel-3.10.0-957.el7.orig \
~/rpmbuild/BUILD/kernel-3.10.0-957.el7.new
The files that mattered were the usual x86 timekeeping files:
arch/x86/kernel/tsc.c
arch/x86/kernel/kvmclock.c
arch/x86/kernel/pvclock.c
kernel/time/clocksource.c
Modern note: for current kernels, I would first check tracepoints, dynamic debug, ftrace, and BPF-capable tracing before rebuilding the kernel. But for an early boot path in an older distribution kernel, small printk markers are still a reasonable baseline.
A small printk marker
The investigation used a deliberately simple macro:
#define __dprintk(n) \
printk(KERN_INFO "No.%s, func: %s, line: %d, file: %s\n", \
#n, __FUNCTION__, __LINE__, __FILE__)
The point is not elegance. The point is to make boot logs answer a direct question:
Which branches did this kernel actually execute?
That matters because boot-time clocksource registration happens before normal user-space observability is available. The log becomes the debugger.
Entry point
In the old kernel, TSC clocksource registration begins in init_tsc_clocksource:
static int __init init_tsc_clocksource(void)
{
if (!cpu_has_tsc || tsc_disabled > 0 || !tsc_khz)
return 0;
if (tsc_clocksource_reliable)
clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
if (check_tsc_unstable()) {
clocksource_tsc.rating = 0;
clocksource_tsc.flags &= ~CLOCK_SOURCE_IS_CONTINUOUS;
}
if (boot_cpu_has(X86_FEATURE_NONSTOP_TSC_S3))
clocksource_tsc.flags |= CLOCK_SOURCE_SUSPEND_NONSTOP;
if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {
clocksource_register_khz(&clocksource_tsc, tsc_khz);
return 0;
}
schedule_delayed_work(&tsc_irqwork, 0);
return 0;
}
device_initcall(init_tsc_clocksource);
There are two obvious paths:
- If the TSC is already trusted, register it immediately.
- Otherwise, schedule delayed work that refines the calibration against HPET or ACPI PM.
The KVM guest was not taking the immediate path. The delayed work became the next target.
Marking the path
The markers were inserted around the branches:
static int __init init_tsc_clocksource(void)
{
if (!cpu_has_tsc || tsc_disabled > 0 || !tsc_khz)
return 0;
__dprintk(1);
if (tsc_clocksource_reliable)
clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
if (check_tsc_unstable()) {
clocksource_tsc.rating = 0;
clocksource_tsc.flags &= ~CLOCK_SOURCE_IS_CONTINUOUS;
}
if (boot_cpu_has(X86_FEATURE_NONSTOP_TSC_S3))
clocksource_tsc.flags |= CLOCK_SOURCE_SUSPEND_NONSTOP;
if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {
clocksource_register_khz(&clocksource_tsc, tsc_khz);
return 0;
}
__dprintk(2);
schedule_delayed_work(&tsc_irqwork, 0);
__dprintk(3);
return 0;
}
Then the delayed work was marked:
static DECLARE_DELAYED_WORK(tsc_irqwork, tsc_refine_calibration_work);
static void tsc_refine_calibration_work(struct work_struct *work)
{
static u64 tsc_start = -1, ref_start;
static int hpet;
u64 tsc_stop, ref_stop, delta;
unsigned long freq;
if (check_tsc_unstable())
goto out;
__dprintk(4);
if (tsc_start == -1) {
__dprintk(5);
hpet = is_hpet_enabled();
schedule_delayed_work(&tsc_irqwork, HZ);
tsc_start = tsc_read_refs(&ref_start, hpet);
return;
}
__dprintk(6);
tsc_stop = tsc_read_refs(&ref_stop, hpet);
if (ref_start == ref_stop)
goto out;
__dprintk(7);
if (tsc_start == ULLONG_MAX || tsc_stop == ULLONG_MAX)
goto out;
__dprintk(8);
delta = tsc_stop - tsc_start;
delta *= 1000000LL;
if (hpet)
freq = calc_hpet_ref(delta, ref_start, ref_stop);
else
freq = calc_pmtimer_ref(delta, ref_start, ref_stop);
if (abs(tsc_khz - freq) > tsc_khz / 100)
goto out;
__dprintk(9);
tsc_khz = freq;
pr_info("Refined TSC clocksource calibration: %lu.%03lu MHz\n",
(unsigned long)tsc_khz / 1000,
(unsigned long)tsc_khz % 1000);
out:
if (boot_cpu_has(X86_FEATURE_ART))
art_related_clocksource = &clocksource_tsc;
__dprintk(10);
clocksource_register_khz(&clocksource_tsc, tsc_khz);
}
This is noisy and ugly. It is also effective. The markers form a tiny execution trace.
Building the test kernel
The edited tree was turned into a distribution patch:
cd ~/rpmbuild/BUILD
diff -uNrp kernel-3.10.0-957.el7.orig \
kernel-3.10.0-957.el7.new \
> ../SOURCES/linux-3.10.0-957.el7.patch
Then the kernel spec was adjusted to apply the patch and give the build a visible ID:
%define buildid .tscheck
Patch1000: linux-3.10.0-957.el7.patch
ApplyOptionalPatch linux-3.10.0-957.el7.patch
The build/install step:
rpmbuild -bb --with baseonly \
--without debuginfo \
--without debug \
--without doc \
--without perf \
--without tools \
--without kdump \
--without bootwrapper \
--target=$(uname -m) kernel.spec
sudo yum localinstall -y ~/rpmbuild/RPMS/x86_64/kernel-*.rpm
This is heavy-handed, but it preserves the distribution kernel packaging shape.
What printk showed
The instrumented kernel produced the useful clue:
$ dmesg | grep -iE '(clocksource|tsc|No\\.)'
[ 0.000000] Linux version 3.10.0-957.el7.tscheck.x86_64 ...
[ 0.000000] tsc: Detected 3000.000 MHz processor
[ 1.804286] TSC deadline timer enabled
[ 2.557401] Switched to clocksource kvm-clock
[ 3.035763] No.1, func: init_tsc_clocksource, line: 1309, file: arch/x86/kernel/tsc.c
[ 3.044996] No.2, func: init_tsc_clocksource, line: 1330, file: arch/x86/kernel/tsc.c
[ 3.054436] No.3, func: init_tsc_clocksource, line: 1332, file: arch/x86/kernel/tsc.c
[ 3.063727] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[ 3.073240] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
[ 4.083424] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[ 4.092902] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
[ 5.085423] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[ 5.085424] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
The trace repeated No.4 and No.5 forever. It never reached No.6.
That means the work function kept re-entering this block:
if (tsc_start == -1) {
hpet = is_hpet_enabled();
schedule_delayed_work(&tsc_irqwork, HZ);
tsc_start = tsc_read_refs(&ref_start, hpet);
return;
}
The real question became:
Why does tsc_start keep behaving like -1?
tsc_read_refs
The answer was in tsc_read_refs:
#define MAX_RETRIES 5
#define SMI_TRESHOLD 50000
static u64 tsc_read_refs(u64 *p, int hpet)
{
u64 t1, t2;
int i;
for (i = 0; i < MAX_RETRIES; i++) {
t1 = get_cycles();
if (hpet)
*p = hpet_readl(HPET_COUNTER) & 0xFFFFFFFF;
else
*p = acpi_pm_read_early();
t2 = get_cycles();
if ((t2 - t1) < SMI_TRESHOLD)
return t2;
}
return ULLONG_MAX;
}
tsc_start was not literally a normal -1 state. The function was returning ULLONG_MAX, which is the all-ones value. The code used that value as a failure sentinel.
The intended operation is:
- Read the TSC into
t1. - Read a reference counter, either HPET or ACPI PM.
- Read the TSC again into
t2. - If the read interval was short enough, accept
t2. - Otherwise, retry a few times and return failure.
The fixed threshold was the weak point. If the TSC count advances too much while the reference counter is being read, every retry fails. In a virtualized environment, that can happen because the guest-observed I/O path to an emulated reference counter can be slow relative to TSC.
Local cause
The local cause was:
The TSC/reference-counter sample was rejected every time.
The delayed calibration work kept rescheduling itself.
The final TSC clocksource registration path never completed.
No useful kernel message explained that this was happening.
That matched the original symptom exactly: tsc existed as hardware, but not as an available Linux clocksource.
Patch experiment 1: raise the threshold
The fastest experiment was crude:
-#define SMI_TRESHOLD 50000
+#define SMI_TRESHOLD 5000000
It was not a good patch, but it was a good experiment. The result:
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock tsc acpi_pm
$ dmesg | grep -iE '(clocksource|tsc|No\\.)'
[ 0.000000] tsc: Detected 3000.000 MHz processor
[ 1.835560] TSC deadline timer enabled
[ 2.605330] Switched to clocksource kvm-clock
[ 3.114963] No.4, func: tsc_refine_calibration_work, line: 1248, file: arch/x86/kernel/tsc.c
[ 3.124533] No.5, func: tsc_refine_calibration_work, line: 1256, file: arch/x86/kernel/tsc.c
[ 4.219336] No.6, func: tsc_refine_calibration_work, line: 1266, file: arch/x86/kernel/tsc.c
[ 4.229233] No.7, func: tsc_refine_calibration_work, line: 1273, file: arch/x86/kernel/tsc.c
[ 4.239024] No.8, func: tsc_refine_calibration_work, line: 1278, file: arch/x86/kernel/tsc.c
[ 4.248844] No.9, func: tsc_refine_calibration_work, line: 1290, file: arch/x86/kernel/tsc.c
[ 4.258687] tsc: Refined TSC clocksource calibration: 3000.004 MHz
[ 4.264562] No.10, func: tsc_refine_calibration_work, line: 1300, file: arch/x86/kernel/tsc.c
The experiment proved the suspected branch. It did not prove that 5000000 was a safe value. A larger fixed threshold can hide real disturbances, and clocksource code is exactly the wrong place to hide real timing problems.
Patch experiment 2: trust KVM frequency information
The other experiment was KVM-specific. The KVM paravirtual clock path already knows a TSC frequency through pvclock data:
static unsigned long kvm_get_tsc_khz(void)
{
struct pvclock_vcpu_time_info *src;
int cpu;
unsigned long tsc_khz;
src = &hv_clock[cpu].pvti;
tsc_khz = pvclock_tsc_khz(src);
preempt_enable();
return tsc_khz;
}
void __init kvmclock_init(void)
{
x86_platform.calibrate_tsc = kvm_get_tsc_khz;
x86_platform.calibrate_cpu = kvm_get_tsc_khz;
}
The experiment was to force the TSC-reliable capability when KVM clock initialization provided the calibration path:
@@ -338,6 +338,7 @@ void __init kvmclock_init(void)
x86_platform.calibrate_tsc = kvm_get_tsc_khz;
x86_platform.calibrate_cpu = kvm_get_tsc_khz;
+ setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
x86_platform.get_wallclock = kvm_get_wallclock;
x86_platform.set_wallclock = kvm_set_wallclock;
The test result:
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock tsc acpi_pm
$ dmesg | grep -iE '(clocksource|tsc)'
[ 0.000000] tsc: Detected 3000.000 MHz processor
[ 1.832686] TSC deadline timer enabled
[ 1.929653] Skipped synchronization checks as TSC is reliable.
[ 2.598602] Switched to clocksource kvm-clock
Again, the experiment proved a direction. It did not automatically make the patch correct for upstream or for every virtualization setup.
Upstream had the better patches
Distribution kernels are often older than upstream. After the local cause was understood, the right move was to search upstream history.
The threshold idea had a much better upstream version:
x86/tsc: Make calibration refinement more robust
The important shape was:
-#define MAX_RETRIES 5
-#define SMI_TRESHOLD 50000
+#define MAX_RETRIES 5
+#define TSC_DEFAULT_THRESHOLD 0x20000
static u64 tsc_read_refs(u64 *p, int hpet)
{
u64 t1, t2;
+ u64 thresh = tsc_khz ? tsc_khz >> 5 : TSC_DEFAULT_THRESHOLD;
int i;
for (i = 0; i < MAX_RETRIES; i++) {
t1 = get_cycles();
...
t2 = get_cycles();
- if ((t2 - t1) < SMI_TRESHOLD)
+ if ((t2 - t1) < thresh)
return t2;
}
return ULLONG_MAX;
}
This is the difference between a local proof and a real fix. The threshold scales with the calibrated TSC frequency instead of pretending one fixed number fits every system.
There was also a KVM-related upstream direction:
kvmclock: fix TSC calibration for nested guests
The relevant shape:
static unsigned long kvm_get_tsc_khz(void)
{
src = &hv_clock[cpu].pvti;
tsc_khz = pvclock_tsc_khz(src);
put_cpu();
+ setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
return tsc_khz;
}
The upstream capability name and exact conditions differ from the local experiment, which is exactly the point. Kernel patches carry architecture, migration, nested virtualization, CPU feature, and regression concerns that are easy to miss from one local reproduction.
What changed since the original note
The original note was centered on printk and source reading. That is still useful, but a current workflow should check less invasive tools first when the code path permits it.
Useful modern tools and interfaces include:
- dynamic debug for enabling
pr_debug()anddev_dbg()callsites at runtime, - ftrace and tracefs for function tracing and event tracing,
- boot-time tracing through bootconfig when the interesting path happens before normal user space,
- BPF tracing when a stable attachment point exists and the system policy allows it,
- and distribution debuginfo plus crash dumps when a failure produces a dumpable state.
For this specific old failure, printk was still a good fit because the suspect path ran early, the branch was simple, and the question was binary: does the delayed work get past this block?
Lessons
The durable lessons are not CentOS 7-specific:
- Compare a failing system with a similar healthy one before patching.
- Treat missing log output as evidence. Silent paths are often where the bug is.
- Instrument branches, not feelings.
- Separate a patch that proves a hypothesis from a patch that should ship.
- Search upstream after you understand the failure well enough to search for the right words.
- Do not trust one virtualized timing result as a universal timing rule.
The story ends in a humble place: the local patch experiments were useful, but upstream had the cleaner fixes. That is a good outcome. Kernel debugging is not only about inventing patches; it is also about learning enough to recognize the right patch when you find it.
References
- Clock sources, clock events, sched_clock() and delay timers
- Message logging with printk
- Dynamic debug
- ftrace - Function Tracer
- Linux Tracing Technologies
- x86/tsc: Make calibration refinement more robust
- kvmclock: fix TSC calibration for nested guests
See also
Related: Linux, virtualized time, system output, linux troubleshooting.