vmm.dev

virtualized time

virtualiz...virtualized timekernel patchkernel patchLinuxLinuxlinux tro...linux troubleshootingsystem ou...system outputシステム出力システム出力vmm.devvmm.dev

virtualized time

Timekeeping in virtualized Linux

Time in a virtual machine is not one thing. User-space sees clocks. The kernel maintains timekeeping state. Hardware or the hypervisor provides counters and wall-clock seeds. Debugging time drift or missing clocksources requires keeping those layers separate.

user space POSIX clocks kernel timekeeper, jiffies, clocksource hardware / hypervisor TSC, pvclock, RTC, HPET

The three-layer model

For this note, it is useful to split time into three layers.

LayerExamplesQuestion
user spaceclock_gettime, date, timerswhat does the application observe?
kerneltimekeeper, hrtimer, jiffies, clocksourcehow does Linux maintain time?
hardware or hypervisorRTC, TSC, HPET, ACPI PM timer, pvclockwhat source feeds the kernel?

The real kernel has more machinery, but this model prevents a common mistake: treating RTC, clocksource, NTP, and application clocks as the same problem.

Initial wall-clock time

At boot, the kernel needs an initial wall-clock value. A real-time clock device can provide calendar-like data: year, month, day, hour, minute, second. Linux converts that into internal timekeeping state.

After boot, the RTC is not the primary mechanism for every timestamp. The system keeps time by measuring elapsed time from a counter and combining that delta with the initial wall-clock value.

time now = boot wall-clock seed + elapsed counter delta

This is why a broken RTC and a bad clocksource produce different symptoms.

Elapsed time is the center

The important idea is that the operating system does not need to ask the RTC for the current calendar time on every timestamp. Once Linux has an initial wall-clock seed, it mainly needs elapsed time.

If a counter increases at a known rate, Linux can read the counter later and calculate the delta:

counter at boot = 20
counter later   = 100
delta           = 80 ticks

With the counter frequency, that becomes elapsed time. With elapsed time and the initial wall-clock seed, Linux can maintain wall-clock time, monotonic time, scheduler timestamps, and timer expirations.

That is why clocksource quality matters so much. A clocksource is not a decorative device. It is the foundation under the timekeeping state that user space eventually observes.

Clocksource

Linux abstracts readable counters through the clocksource framework. A good clocksource should be monotonic, stable, fast to read, high enough resolution, and not too expensive under virtualization.

PIT slow HPET emulated TSC fast pvclock guest aware Linux abstracts these as clocksources.

Classic PC timers include PIT, HPET, and the ACPI PM timer. They are useful as fallback or compatibility devices, but they are often poor primary clocksources in virtual machines because reads may involve emulation.

What a good counter should provide

The original note listed the properties in plain terms. A useful clocksource should:

  • not move backward under normal operation,
  • not stop unexpectedly,
  • not jump unless the kernel knows how to account for it,
  • have a stable or known frequency,
  • have enough resolution for the workload,
  • be cheap to read,
  • work across CPUs or be constrained so Linux can use it safely,
  • and, ideally, support fast user-space reads through vDSO when appropriate.

No physical or virtual device is perfect. Linux therefore rates, validates, and sometimes rejects clocksources. The tsc can be excellent, but only if Linux trusts its frequency and stability in that environment.

Legacy devices in virtual machines

PC-compatible systems have long exposed devices such as:

DeviceRoleVM concern
PITold programmable interval timerlow resolution and usually emulated
HPEThigh precision event timercan be expensive when emulated
ACPI PM timerplatform timerread path may be slow in a guest
RTCwall-clock seed and persistent clocklifecycle semantics depend on platform

These devices are still useful for compatibility, fallback, and calibration. They are usually not what you want as the primary clocksource in a modern VM.

The reason is cost. A guest read of an emulated legacy device can require a VM exit or a hypervisor-assisted path. A time read is a common operation, so that cost matters.

TSC

The Time Stamp Counter is a CPU counter. On modern systems it is often the best clocksource because it is extremely fast to read. The catch is stability. Linux must know whether the TSC is invariant, synchronized enough across CPUs, and usable under the hypervisor's behavior.

A TSC that changes frequency, stops in deep idle states, jumps during migration, or differs between CPUs is dangerous as a primary clocksource. Modern hardware and hypervisors generally handle this much better than older systems, but the kernel still has validation paths for a reason.

Historically, TSC handling was one of the places where virtualization details leaked into the guest. Migration, CPU frequency behavior, nested virtualization, and the way the hypervisor presents CPU features can all affect whether Linux trusts it.

When the TSC is usable, it is often the best answer. When it is not provably usable, Linux may choose a paravirtual clocksource instead.

pvclock and paravirtual clocks

Virtualized guests need help. KVM and Xen can expose paravirtual clock information so the guest does not treat every time read as an emulated legacy-device access. In Linux, KVM guests commonly use kvm-clock; Xen guests use Xen clock support.

The point is not that pvclock is magical. The point is that the hypervisor and guest cooperate so Linux can compute time cheaply and consistently.

The paravirtual interface can provide data that lets the guest convert a counter value into nanoseconds without repeatedly reading slow legacy devices. This is why kvm-clock or Xen clock support can be a better primary clocksource than HPET or ACPI PM in a guest.

The relationship between pvclock and TSC is close. pvclock often uses TSC-derived data with hypervisor-provided conversion information. That is why the missing-tsc story in kernel patch eventually led into KVM clock and TSC calibration code.

Practical expectations

EnvironmentUsually goodUsually poor
KVM guestkvm-clock, validated TSCHPET or ACPI PM as primary source
Xen guestXen clocksource, validated TSC on suitable hardwareemulated legacy timers
bare metal modern x86invariant TSCPIT as primary source

Always check the actual machine. Hypervisor version, CPU features, migration settings, and kernel version matter.

The earlier version had more opinionated tables. They are useful as intuition, but they should not be read as universal rules for every current cloud or hypervisor release.

Older Xen-like environment

SourceRead/write shapevDSO-friendlyPractical note
paravirtual clockread-orientedoften nogood baseline for old guests
TSCread and sometimes virtualized write contextyes when trustedrisky on older hardware or migration paths
ACPI PMread-onlynofallback, not preferred
HPETread-onlynofallback, not preferred
jiffieskernel tick statenonot a precision time source

The warning in the original note was mostly about older hardware and migration behavior. If a guest can move between hosts where TSC behavior is not consistent, blindly trusting TSC is unsafe.

Newer Xen-like environment

SourceRead/write shapevDSO-friendlyPractical note
paravirtual clockread-orientedenvironment-dependentstill useful
validated TSCread pathyescan be excellent when invariant and synchronized
ACPI PMread-onlynofallback
HPETread-onlynofallback
jiffieskernel tick statenonot a precision time source

The shift is that newer CPU and hypervisor support can make TSC much more attractive.

KVM-like environment

SourceRead/write shapevDSO-friendlyPractical note
kvm-clockparavirtual clocksourceyes on common setupsstrong default in many guests
validated TSCfast counteryesstrong when Linux trusts it
ACPI PMread-onlynofallback and calibration reference
HPETread-onlynofallback and calibration reference
jiffieskernel tick statenonot a primary precision source

In practice, the selected clocksource is a kernel decision. Do not force a source because a table says it is theoretically faster. First read what Linux selected and why.

Checking Linux state

The selected and available clocksources are exposed in sysfs:

cat /sys/devices/system/clocksource/clocksource0/current_clocksource
cat /sys/devices/system/clocksource/clocksource0/available_clocksource

CPU flags can explain what Linux believed about the processor:

grep -m1 '^flags' /proc/cpuinfo

Boot logs often explain clocksource decisions:

dmesg -T | grep -iE 'clocksource|tsc|kvm-clock|xen|hpet|acpi_pm'

User-space time state is visible through:

timedatectl status

If the problem is application-visible time, also check which clock the application uses. CLOCK_REALTIME, CLOCK_MONOTONIC, timerfd, scheduler time, and file timestamps do not all answer the same question.

RTC is a different question

The RTC is mostly about the wall-clock seed and about time across power transitions. In a VM, RTC persistence depends on the hypervisor and lifecycle operation. Reboot, shutdown, suspend, stop/start, and migration can behave differently.

Hypervisor caseGuest can read/write RTCReboot behaviorStop/start behavior
KVM-style local VMyesusually persistentdepends on management layer
Xen-style guestyesdepends on domain lifecycledepends on toolstack
cloud VMabstracted by platformprovider-specificprovider-specific

Treat RTC observations as platform evidence, not universal law.

The original note emphasized a useful operational point: if the guest time is wrong after boot, do not blame only the RTC. After boot, the ongoing drift is more often about the selected clocksource, host/guest synchronization, NTP or chrony state, suspend/resume, or migration behavior.

NTP and correction

Clocksource and NTP solve different layers.

The clocksource gives Linux a local timeline. NTP or chrony compares the system to external time and adjusts it. If the local counter is unstable, NTP may continuously correct symptoms without removing the underlying source of drift.

Useful checks:

timedatectl status
chronyc tracking
chronyc sources -v

For incident timelines, this matters because a large correction can make log ordering confusing. When time is part of the incident, record both the synchronization state and the kernel clocksource state.

What can go wrong

Common failure shapes:

SymptomPossible layer
wrong time immediately after bootRTC seed, platform lifecycle, image config
steady drift after bootclocksource stability, NTP state, host issue
time jumps after resume or migrationhypervisor lifecycle, TSC scaling, synchronization
high cost in time readsemulated legacy clocksource
tsc missing from available clocksourceskernel validation, CPU flags, hypervisor presentation
logs appear out of orderclock adjustment, different clocks, delayed logging

The missing-tsc case in kernel patch belongs to the validation row. Linux had to decide whether a counter was trustworthy enough to expose.

Debugging checklist

When time appears wrong, ask:

  1. Is the application using realtime, monotonic, or another clock?
  2. What clocksource did Linux select?
  3. What clocksources were available?
  4. Did boot logs mark TSC unstable?
  5. Is NTP correcting a boot-time error after the fact?
  6. Did the VM migrate or resume?
  7. Is the hypervisor exposing paravirtual clock features?

The kernel patch note is one concrete example: a KVM guest did not list tsc as an available clocksource, so the investigation had to move from sysfs output into kernel validation logic.

References

See also

Related: Linux, kernel patch, linux troubleshooting.