Bug #13354

illumos should calibrate the TSC earlier in the boot process.

Added by Jason King 4 months ago. Updated 4 months ago.

Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


Since time immemorial, illumos has used the i8253/8254 PIT timer to do three key tasks during boot:
  • Experimentally derive a sufficiently large value (microdata) on the system that will cause a minimum ten microsecond delay when calling tenmicrosec(). This is essentially to bridge the gap for drivers between when startup_modules() starts loading drivers and when the TSC (timestamp counter) is calibrated (and thus able to be used) when the PSM is loaded.
  • Measure the speed of the Local APIC clock to calibrate the Local APIC timer.
  • Measure the frequency of the timestamp counter (TSC) for use in gethrtime(9F). This allows the TSC to be used (among other things) in tenmicrosec() instead of a busy loop.

The latter two tasks are done rather late in the boot process, as a part of the PSM (platform support module?) initialization. This arrangement is unnecessary and appears to come from a time when Solaris ran on X86 chips that did not include a TSC. The TSC was first introduced in the original Pentium processor, and it's unlikely that illumos has ever been able to run on a system that did not include it (even prior to removing support for 32-bit CPUs). This arrangement also complicates supporting alternatives to the PIT timer for calibration since each new calibration source would need to support performing all of those tasks.

However, if we perform the TSC calibration early in the boot process (prior to startup_modules() being called), we then no longer need to derive a value for microdata (with a very small additional bit of work, we can eliminate the need for the variable and the whole microfind.c file completely). Additionally, the Local APIC can then be calibrated with the TSC instead of the PIT timer (any X86 system illumos supports will have a TSC frequency on the order of 1000x higher than the PIT, so it's more than good enough for this). The later TSC calibration during PSM initialization is then no longer necessary.

With this setup in place, the PIT timer can easily become just one of several possible methods for measuring the TSC frequency. Any alternative methods will only need to support reporting the TSC frequency -- either through measurement, or reporting the known value (e.g. running as a hypervisor guest on hypervisors that report the TSC frequency to guests).


Updated by Electric Monk 4 months ago

  • Gerrit CR set to 1097

Updated by Patrick Mooney 4 months ago

Additionally, the Local APIC can then be calibrated with the TSC instead of the PIT timer (any X86 system illumos supports will have a TSC frequency on the order of 1000x higher than the PIT, so it's more than good enough for this).

I'm not sure the precision of the TSC is an absolute win when it comes to calibrating the frequency of the APIC timer. If we calibrate the TSC against the PIT, there is going to be some amount of error in that measurement. If the resulting TSC frequency is used to then calibrate the APIC clock, the error would be compounded.

Ultimately, there will still be situations where calibrating the APIC from the TSC is advantageous. If the PIT is absent, or we're inside a VM which has given us a precise (and presumed accurate) value for the TSC frequency, then the TSC is a good choice. Using it as the baseline will prevent a giant explosion of source-target calibration pairs.


Updated by Jason King 4 months ago

To provide some data points, comparing the results of calibrating the APIC with the TSC vs. the PIT (looking at the result of apic_calibrate_tsc vs apic_calibrate_pit), on my home server, the worst difference observed between the TSC and the PIT values is 0.006%. The PIT calibration also shows a bit more variability in it's results than the TSC:

52430 52428 2
52430 52428 2
52428 52428 0
52427 52428 1
52426 52428 2
52429 52427 2
52426 52427 1
52429 52427 2
52426 52427 1
52430 52427 3

Looking at a bhyve VM running OmniOS w/ the change:

Idle Host
70363 70361 2
70365 70361 4
70362 70361 1
70366 70361 5
70365 70361 4
Busy Host
70380 70311 69
70389 70336 53
70376 70399 23
70373 70318 55
70384 70316 68

Even there, the worst difference was around 0.1% when booting the the VM while the host was busy (starting this VM + about 10 other instances). When idle, the difference was obviously far less (0.07%).

I've not seen any analysis of the precision of the PIT timer itself, so doing a formal analysis is more difficult. The methods I'm familiar with assume a normal distribution surrounding measurement/error. I'm not sure with the discrete nature of things how applicable those would be here. If we can in fact assume a normal distribution, the calculation I believe would be something along the lines (due to the lack of mathematical markup support, I'm going to try to break this up a bit to hopefully make it easier to follow).

  • apic_diff = apic_tick_end - apic_tick_start
  • pit_diff = pit_tick_end - pit_tick_start (I believe it actually counts down, but we only care about magnitudes for the uncertainty)
  • tsc_diff = tsc_tick_end - tsc_tick_start

If we assume the uncertainty of each reading is +/- 1 {apic,tsc,pit} tick, then the uncertainties of the differences would be sqrt(2)*tick frequency. For the PIT, that's approx ~1.4 Mhz, for the TSC, it's going to be ~1.4*TSC frequency (so something on the order of GHz).

The APIC frequency calculation is then:
  • CONSTANT * apic_diff / pit_diff (using PIT)
  • CONSTANT * tsc_freq * apic_diff / tsc_diff (using TSC)

The error here is measured_apic_freq * sqrt[ (apic_diff_uncertainty / apic_value)^2 + (pit_diff_uncertainty/pit val)^2 ] using the PIT.
When using the TSC, the error would be measured_apic_freq * sqrt[ (apic_diff_uncertainty / apic_value) ^2 + (tsc_freq_uncertainty/tsc_diff)^2 + (tsc_diff_uncertainty/tsc_diff)^2 ]

Since the TSC frequency is measured with the PIT, the uncertainty for the TSC frequency is going to be the same as the PIT difference uncertainty (you need to do a bit of unit conversion, but essentially even with the larger values of for the TSC timestamp differences, the uncertainty in the difference of values is still going to translate to ~ +/- 1.4Mhz for the TSC frequency).

Note that then the only difference is the addition of the uncertainty of the TSC difference measurement. If we can assume the TSC difference also have an uncertainty of +/- 1 TSC tick, that's at least 1000x less. The actual values are going to vary based on the actual numbers measured, but basically it's adding less than 1/1000th the amount of error to the result. This does seem to agree with the above numbers measured on raw iron (though again, this is all dependent on the assumption the errors on the measured values follow a normal distribution). The VM numbers of course are going to be influenced by how busy the host is (since the PIT is emulated), so the higher variability there would be expected, though even there it's not too bad. Of course it would be possible for someone to create a pathological case where the results are worse, but that seems like it would just make both the TSC frequency calculation and APIC frequently calculation suffer equally.


Updated by Jason King 4 months ago

Also, if someone does have some info on this that would allow for better calculation of the error, please chime in -- it's not an area I've researched -- I'd be genuinely curious to know if/how the discrete nature might impact such calculations.


Updated by Joshua M. Clulow 4 months ago

In your example measurements, the numbers for the VM are higher than for the hardware. Is the VM from which the second table of readings originated running on the host from which the first table was taken?


Updated by Jason King 4 months ago

Yes, though the difference appears to be independent of this change. The existing APIC calibration code does 5 measurements and saves the median value into the global apic_ticks_per_SFnsecs (see apic_timer_init()). Looking at the same system (and same VM) running unmodified kernels (e.g. plain SmartOS as host, plain OmniOS CE as guest), the same difference is seen:

Host (SmartOS):

[root@bob ~]# mdb -ke 'apic_ticks_per_SFnsecs/E'
apic_ticks_per_SFnsecs:         52428

Guest: (OmniOS)

root@pi:~#  mdb -ke 'apic_ticks_per_SFnsecs/E'
apic_ticks_per_SFnsecs:         70369

Also available in: Atom PDF