Project

General

Profile

Actions

Bug #13354

closed

illumos should calibrate the TSC earlier in the boot process.

Added by Jason King 11 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
kernel
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

Since time immemorial, illumos has used the i8253/8254 PIT timer to do three key tasks during boot:
  • Experimentally derive a sufficiently large value (microdata) on the system that will cause a minimum ten microsecond delay when calling tenmicrosec(). This is essentially to bridge the gap for drivers between when startup_modules() starts loading drivers and when the TSC (timestamp counter) is calibrated (and thus able to be used) when the PSM is loaded.
  • Measure the speed of the Local APIC clock to calibrate the Local APIC timer.
  • Measure the frequency of the timestamp counter (TSC) for use in gethrtime(9F). This allows the TSC to be used (among other things) in tenmicrosec() instead of a busy loop.

The latter two tasks are done rather late in the boot process, as a part of the PSM (platform support module?) initialization. This arrangement is unnecessary and appears to come from a time when Solaris ran on X86 chips that did not include a TSC. The TSC was first introduced in the original Pentium processor, and it's unlikely that illumos has ever been able to run on a system that did not include it (even prior to removing support for 32-bit CPUs). This arrangement also complicates supporting alternatives to the PIT timer for calibration since each new calibration source would need to support performing all of those tasks.

However, if we perform the TSC calibration early in the boot process (prior to startup_modules() being called), we then no longer need to derive a value for microdata (with a very small additional bit of work, we can eliminate the need for the variable and the whole microfind.c file completely). Additionally, the Local APIC can then be calibrated with the TSC instead of the PIT timer (any X86 system illumos supports will have a TSC frequency on the order of 1000x higher than the PIT, so it's more than good enough for this). The later TSC calibration during PSM initialization is then no longer necessary.

With this setup in place, the PIT timer can easily become just one of several possible methods for measuring the TSC frequency. Any alternative methods will only need to support reporting the TSC frequency -- either through measurement, or reporting the known value (e.g. running as a hypervisor guest on hypervisors that report the TSC frequency to guests).


Related issues

Related to illumos gate - Bug #13961: Add HPET as a TSC calibration sourceClosedJason King

Actions
Actions #1

Updated by Electric Monk 10 months ago

  • Gerrit CR set to 1097
Actions #2

Updated by Patrick Mooney 10 months ago

Additionally, the Local APIC can then be calibrated with the TSC instead of the PIT timer (any X86 system illumos supports will have a TSC frequency on the order of 1000x higher than the PIT, so it's more than good enough for this).

I'm not sure the precision of the TSC is an absolute win when it comes to calibrating the frequency of the APIC timer. If we calibrate the TSC against the PIT, there is going to be some amount of error in that measurement. If the resulting TSC frequency is used to then calibrate the APIC clock, the error would be compounded.

Ultimately, there will still be situations where calibrating the APIC from the TSC is advantageous. If the PIT is absent, or we're inside a VM which has given us a precise (and presumed accurate) value for the TSC frequency, then the TSC is a good choice. Using it as the baseline will prevent a giant explosion of source-target calibration pairs.

Actions #3

Updated by Jason King 10 months ago

To provide some data points, comparing the results of calibrating the APIC with the TSC vs. the PIT (looking at the result of apic_calibrate_tsc vs apic_calibrate_pit), on my home server, the worst difference observed between the TSC and the PIT values is 0.006%. The PIT calibration also shows a bit more variability in it's results than the TSC:

PIT TSC DIFF
52430 52428 2
52430 52428 2
52428 52428 0
52427 52428 1
52426 52428 2
52429 52427 2
52426 52427 1
52429 52427 2
52426 52427 1
52430 52427 3

Looking at a bhyve VM running OmniOS w/ the change:

PIT TSC DIFF
Idle Host
70363 70361 2
70365 70361 4
70362 70361 1
70366 70361 5
70365 70361 4
Busy Host
70380 70311 69
70389 70336 53
70376 70399 23
70373 70318 55
70384 70316 68

Even there, the worst difference was around 0.1% when booting the the VM while the host was busy (starting this VM + about 10 other instances). When idle, the difference was obviously far less (0.07%).

I've not seen any analysis of the precision of the PIT timer itself, so doing a formal analysis is more difficult. The methods I'm familiar with assume a normal distribution surrounding measurement/error. I'm not sure with the discrete nature of things how applicable those would be here. If we can in fact assume a normal distribution, the calculation I believe would be something along the lines (due to the lack of mathematical markup support, I'm going to try to break this up a bit to hopefully make it easier to follow).

  • apic_diff = apic_tick_end - apic_tick_start
  • pit_diff = pit_tick_end - pit_tick_start (I believe it actually counts down, but we only care about magnitudes for the uncertainty)
  • tsc_diff = tsc_tick_end - tsc_tick_start

If we assume the uncertainty of each reading is +/- 1 {apic,tsc,pit} tick, then the uncertainties of the differences would be sqrt(2)*tick frequency. For the PIT, that's approx ~1.4 Mhz, for the TSC, it's going to be ~1.4*TSC frequency (so something on the order of GHz).

The APIC frequency calculation is then:
  • CONSTANT * apic_diff / pit_diff (using PIT)
  • CONSTANT * tsc_freq * apic_diff / tsc_diff (using TSC)

The error here is measured_apic_freq * sqrt[ (apic_diff_uncertainty / apic_value)^2 + (pit_diff_uncertainty/pit val)^2 ] using the PIT.
When using the TSC, the error would be measured_apic_freq * sqrt[ (apic_diff_uncertainty / apic_value) ^2 + (tsc_freq_uncertainty/tsc_diff)^2 + (tsc_diff_uncertainty/tsc_diff)^2 ]

Since the TSC frequency is measured with the PIT, the uncertainty for the TSC frequency is going to be the same as the PIT difference uncertainty (you need to do a bit of unit conversion, but essentially even with the larger values of for the TSC timestamp differences, the uncertainty in the difference of values is still going to translate to ~ +/- 1.4Mhz for the TSC frequency).

Note that then the only difference is the addition of the uncertainty of the TSC difference measurement. If we can assume the TSC difference also have an uncertainty of +/- 1 TSC tick, that's at least 1000x less. The actual values are going to vary based on the actual numbers measured, but basically it's adding less than 1/1000th the amount of error to the result. This does seem to agree with the above numbers measured on raw iron (though again, this is all dependent on the assumption the errors on the measured values follow a normal distribution). The VM numbers of course are going to be influenced by how busy the host is (since the PIT is emulated), so the higher variability there would be expected, though even there it's not too bad. Of course it would be possible for someone to create a pathological case where the results are worse, but that seems like it would just make both the TSC frequency calculation and APIC frequently calculation suffer equally.

Actions #4

Updated by Jason King 10 months ago

Also, if someone does have some info on this that would allow for better calculation of the error, please chime in -- it's not an area I've researched -- I'd be genuinely curious to know if/how the discrete nature might impact such calculations.

Actions #5

Updated by Joshua M. Clulow 10 months ago

In your example measurements, the numbers for the VM are higher than for the hardware. Is the VM from which the second table of readings originated running on the host from which the first table was taken?

Actions #6

Updated by Jason King 10 months ago

Yes, though the difference appears to be independent of this change. The existing APIC calibration code does 5 measurements and saves the median value into the global apic_ticks_per_SFnsecs (see apic_timer_init()). Looking at the same system (and same VM) running unmodified kernels (e.g. plain SmartOS as host, plain OmniOS CE as guest), the same difference is seen:

Host (SmartOS):

[root@bob ~]# mdb -ke 'apic_ticks_per_SFnsecs/E'
apic_ticks_per_SFnsecs:
apic_ticks_per_SFnsecs:         52428

Guest: (OmniOS)

root@pi:~#  mdb -ke 'apic_ticks_per_SFnsecs/E'
apic_ticks_per_SFnsecs:
apic_ticks_per_SFnsecs:         70369
Actions #7

Updated by Jason King 3 months ago

  • Related to Bug #13961: Add HPET as a TSC calibration source added
Actions #8

Updated by Jason King 3 months ago

To test the source override mechanism, to provide a second calibration source, I built a BE with both this change and the (soon to be under review) code to support using the HPET.

I then booted the BE without any other changes, and looked at tsc_calibration_source to verify the HPET was used to calibrate the TSC by default (the HPET support sets the preference higher than the PIT, so this was expected).
Then I added set tsc_calibration="pit" to /etc/system and rebooted the BE, and repeated the check to verify the PIT was used.
Then I removed the /etc/system line, rebooted and from the loader prompt, used set tsc_calibration=pit, booted and verified that again the PIT was used.
Then I added set tsc_calibration="hpet" to /etc/system, rebooted, and at the loader prompt, once again entered set tsc_calibration=pit to verify that loader options took precedence over /etc/system settings.

Actions #9

Updated by Jason King 3 months ago

Since there's been some adjustments over the course of the review, to hopefully pull things together:

This change moves the calibration of the TSC to early in the boot process (after the VM system has started/initialized, but prior to the loading of any modules). This eliminates the need for any stopgap measures. The historical evidence suggests the TSC was previously calibrated late in the boot process due to support of x86 systems without a TSC (which we've long since stopped supported -- all amd64 systems are required to have a TSC).

As a part of moving of the TSC calibration, infrastructure has been added to allow for easily adding support for alternate methods of calibrating the TSC rather than the PIT (e.g. #13961) in future changes (the code is written and ready for review for a couple of alternate calibration sources, but I wanted to get the infrastructure bits nailed down to avoid additional churn in those reviews). Some newer systems (e.g. the latest Intel NUCs) as well as some virtual environments (e.g. Hyper-V/Azure Gen2 VMs) do not include a functioning PIT.

This infrastructure allows one to declare an alternate calibration source w/ a numerical preference. The relative order of the preference determines the order in which the sources are to be tried. The calibration code is expected to detect if the currently running system is compatible and return if unsupported (to allow other sources to be tried). The PIT timer is intended to remain as a source of last resort, and should have the lowest preference.

As a part of the above infrastructure, an escape hatch is included that will allow an operator to explicitly specify a calibration source via two methods. Either by entering set tsc_calibration=SOURCE at the loader prompt, or by adding set tsc_calibration="SOURCE" in /etc/system. The loader value takes precedence over the /etc/system value (if both are present). If an incorrect calibration source is entered with this escape hatch (e.g. typo), a warning is written to the console, and all of the known calibration sources are tried as normal. As mentioned above, currently the PIT is the only defined calibration source, but that is intended to change soon.

Additionally, the TSC is now used to calibrate the APIC clock. Prior to this change, the TSC frequency was not established at the time the APIC is initialized (and as mentioned, the structure of the PSMs and such appear to have been done to support x86 systems without a TSC -- something no longer relevant).

As an interim measure, regardless of the calibration source used, the infrastructure will still collect TSC and APIC timings using the PIT when another calibration source is used, but only save the results for examination via kmdb and not otherwise use the values. This is intended to be for diagnostic/troubleshooting purposes.

To facilitate support of systems without a functioning PIT, we also include code to attempt to detect when a PIT is missing or non-functional. The observed behavior (to date) on all such systems is that the initial counter values written to the PIT registers never decrease, regardless of how long the systems waits. The detection code sets an initial value, and waits an amount of time that across the range of speeds of systems available today (or in the foreseeable future) should observe the PIT timer decreasing (the comments go into detail on the value used). If a non-functioning PIT is found, the variable pit_is_broken is set, which prevents the system from collecting the above mentioned diagnostic PIT values. pit_is_broken can also be set via /etc/system to always force this behavior -- this is in case other systems are encountered where accessing the PIT registers might cause system instability (i.e. crash).

Actions #10

Updated by Jason King 3 months ago

With the above summary, to detail the testing:

the change was booted on both a physical machine (SmartOS w/ the change built), as well as BE built with the change (OmniOS). Both systems booted successfully.

Additionally, to validate the APIC clock calibration, the TSC and PIT timings were compared as detailed above in the ticket. Within an expected amount of variance (w/ more variance when booting a VM on a busy host) -- fractions of a percent -- both sources produced the same timings. This agrees with the analysis of the error propagation -- while calibrating the TSC with the PIT, and then the APIC with the TSC does (in theory) produce a small amount of additional error, however that error is dwarfed by the measurement error of the PIT (i.e. it's barely even noise).

To validate the escape hatches, I had to build another image w/ an additional calibration source to provide an alternate source. In that instance, I used the code to calibrate using the HPET. Since the HPET test code had a higher preference than the PIT, I verified that the default behavior was to use the HPET. Then I tested overriding this default w/ the PIT via both the loader and /etc/system. I then also verified that the loader setting took precedence over /etc/system by specifying the PIT via the loader, and HPET via /etc/system (and then looking at the source used to calibrate).

Actions #11

Updated by Electric Monk 3 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 575694f6d7409fab127e6f2f326e13a764cc9a00

commit  575694f6d7409fab127e6f2f326e13a764cc9a00
Author: Jason King <jason.brian.king@gmail.com>
Date:   2021-08-04T04:09:04.000Z

    13354 illumos should calibrate the TSC earlier in the boot process.
    Reviewed by: Andy Fiddaman <andy@omnios.org>
    Reviewed by: Joshua M. Clulow <josh@sysmgr.org>
    Approved by: Robert Mustacchi <rm@fingolfin.org>

Actions

Also available in: Atom PDF