Project

General

Profile

Actions

Bug #9596

closed

Initial xsave xstate_bv should not include all features

Added by Robert Mustacchi almost 3 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Category:
kernel
Start date:
2018-06-13
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

This issue is related to 9595 but ultimately is pernicious enough
that it deserves its own bug. The xstate_bv vector in the xsave_state is
used to indicate which portions of the xsave state are valid. When the
processor issues an xsave instruction, it looks at the current processor
state and writes out the intersection of that and the requested, enabled
features. For each state it writes out, it will then set the
corresponding bit in the xstate_bv part of the xsave structure, if that
state has been modified from its initial value.

When restoring, the processor does a similar thing where it will restore
the intersection of the requested and enabled featuers with the values
that are in the xstate_bv vector. When these are restored, the processor
will track these and set the corresponding bits in the XINUSE and
XMODIFIED registers. These are used for future optimizations about what
to save out. For example, the xsaveopt instruction will not save out
requested sections that are in their default state.

Now, the reason this is important is due to the fact that the %xmm
registers shadow the %ymm registers which shadow the %zmm registers.
This is most pronounced in the use of the %zmm registers. The
fundamental challenge is that when using the non-vex prefixed
instructions, the legacy xmm instructions end up effectively ignoring
the upper bits of the register. This means that the processor is
silently saving and restoring them. This causes a performance
degradation as seen and noted in OS-6917.

Today in the OS, we initialize the xstate_bv vector as part of creating
a new lwp. Today we basically set it to the value of everything enable
in xcr0. This means, that despite the fact that the ymm and zmm state is
all in its initial state, when we switch into this new thread, we'll end
up indicating to the processor that these registers are not in their
default state.

The solution is to go through and change these to match what we actually
are initializing per the ABI. Specifically the initial AVX state
initializes the traditional x87 and SSE state. It only initializes the
portions of the xsave state that correspond to the required headers such
as the xstate_bv member. Everything else is always zeroed, which the
processor considers its default state.

As an example, without this change, we can see the same processor end up
causing a 3x overhead in processing aesni instructions that aren't using
the vex prefix. For example, this first case is without the fix:

# /usr/bin/amd64/openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 66871442 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 17677513 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 4473955 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 1119170 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 140369 aes-128-cbc's in 3.00s
OpenSSL 1.0.2n  7 Dec 2017
built on: date not available
options:bn(64,64) md2(int) rc4(16x,int) des(idx,cisc,16,int) aes(partial) blowfish(idx) 
compiler: information not available
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     356647.69k   377120.28k   381777.49k   382010.03k   383300.95k

The second is with it:

# openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 160210563 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 56611505 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 14496761 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 3640241 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 455682 aes-128-cbc's in 3.00s
OpenSSL 1.0.2n  7 Dec 2017
built on: date not available
options:bn(64,32) md2(int) rc4(8x,mmx) des(ptr,risc1,16,long) aes(partial) blowfish(idx) 
compiler: information not available
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     854456.34k  1207712.11k  1237056.94k  1242535.59k  1244315.65k

Testing this consisted of three different phases:

1. Correctness
2. Performance
3. Regression
Correctness

To test the first phase I used a platform with the new ld to launch
64-bit applications and used the demo audit libraries with 64-bit
applications to verify that they executed properly. I did this with both
the default fp save based routine and then used mdb to change the method
back to the fallback position (setting fp_elf to -1). I did this on the
following systems:

  • 1u SMCI Haswell (Intel(r) Xeon(r) CPU E3-1270 v3 @ 3.50GHz) (ymm/xsaveopt)
  • 1u SMCI Ivy Bridge (Intel(r) Xeon(r) CPU E3-1270 V2 @ 3.50GHz) (ymm/xsaveopt)
  • 2u SMCI Skylake (Intel(r) Xeon(r) Gold 6132 CPU @ 2.60GHz) (zmm/xsaveopt)
  • Westmere system from Patrick (xmm/fxsave)
  • VMware VM: Intel(r) Core(tm) i7-3635QM CPU @ 2.40GHz (avx/xsave)

To further verify the audit library and plt transition logic, I manually
went through and snapshotted the register state at the start and return
and made sure that they were equivalent. Because mdb cannot currently
see the ymm / zmm state, I went through and instead checked the xsave
state manually in the kernel that we had saved in a few cases.

Finally on all the systems other than Patrick's Westermere system I
cycled back and tested all of this on debug as well.
Performance

The next phase was performance testing to make sure the set of changes
resulted in concrete improvements when using non-VEX encoded %xmm
manipulating instructions. To test this, I compared and constrated two
different approaches. The first is looking at openssl's builtin speed
benchmarks for aesni which uses the xmm registers. For this, I focused
on skylake systems. For example, on a system without these changes we
saw this performing as:

# /usr/bin/amd64/openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 66871442 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 17677513 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 4473955 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 1119170 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 140369 aes-128-cbc's in 3.00s
OpenSSL 1.0.2n  7 Dec 2017
built on: date not available
options:bn(64,64) md2(int) rc4(16x,int) des(idx,cisc,16,int) aes(partial) blowfish(idx) 
compiler: information not available
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     356647.69k   377120.28k   381777.49k   382010.03k   383300.95k

Now, on that same system we instead see:

# /usr/bin/amd64/openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 202879648 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 56029413 aes-128-cc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 14459222 aes-128-cc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 3639867 aes-128-cb's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 455709 aes-128-cbcs in 3.00s
OpenSSL 1.0.2n  7 Dec 2017
built on: date not available
options:bn(64,64) md2(int) rc4(16x,int) des(idx,cisc,16,int) aes(partial) blowfish(idx) 
compiler: information not available
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc    1082024.79k  1195294.14k  1233853.61k  1242407.94k  1244389.38k

This is practically a 3x improvement. We also went through and tested a
KVM situation. Here, on a different Skylake system with the 6140 CPUs,
we tested mkl in a windows vm with 4 vcpus. Note that these Windows VMs
are not given the ability to see xsave or any AVX related instructions.

In a run on older bits, we see the following from the mkl benchmark with
the default options and threads:

Sample data file lininput_xeon64.

Current date/time: Thu Jun 07 11:59:50 2018

CPU frequency:    2.586 GHz
Number of CPUs: 1
Number of cores: 4
Number of threads: 4

Parameters are set to:

Number of tests: 15

Number of equations to solve (problem size) : 1000  2000  5000  10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
Leading dimension of array                  : 1000  2000  5008  10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
Number of trials to run                     : 4     2     2     2     2     2     2     2     2     2     1     1     1     1     1    
Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     4     4     4     1     1     1     1    
Maximum memory requested that can be used=16200901024, at the size=45000

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
1000   1000   4      0.084      7.9568   9.419757e-13 3.212379e-02   pass
1000   1000   4      0.063      10.6705  9.419757e-13 3.212379e-02   pass
1000   1000   4      0.068      9.8954   9.419757e-13 3.212379e-02   pass
1000   1000   4      0.063      10.6429  9.419757e-13 3.212379e-02   pass
2000   2000   4      0.457      11.6955  4.657913e-12 4.051814e-02   pass
2000   2000   4      0.456      11.7025  4.657913e-12 4.051814e-02   pass
5000   5008   4      7.240      11.5176  3.111936e-11 4.339344e-02   pass
5000   5008   4      7.167      11.6345  3.111936e-11 4.339344e-02   pass
10000  10000  4      53.559     12.4511  9.915883e-11 3.496441e-02   pass
10000  10000  4      53.591     12.4437  9.915883e-11 3.496441e-02   pass

On the same system with newer bits we see the following:

Sample data file lininput_xeon64.

Current date/time: Thu Jun 07 10:06:31 2018

CPU frequency:    2.982 GHz
Number of CPUs: 1
Number of cores: 4
Number of threads: 4

Parameters are set to:

Number of tests: 15

Number of equations to solve (problem size) : 1000  2000  5000  10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
Leading dimension of array                  : 1000  2000  5008  10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
Number of trials to run                     : 4     2     2     2     2     2     2     2     2     2     1     1     1     1     1    
Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     4     4     4     1     1     1     1    
Maximum memory requested that can be used=16200901024, at the size=45000

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
1000   1000   4      0.029      22.8182  9.419757e-13 3.212379e-02   pass
1000   1000   4      0.019      35.3216  9.419757e-13 3.212379e-02   pass
1000   1000   4      0.019      34.8625  9.419757e-13 3.212379e-02   pass
1000   1000   4      0.019      34.3931  9.419757e-13 3.212379e-02   pass
2000   2000   4      0.142      37.6467  4.657913e-12 4.051814e-02   pass
2000   2000   4      0.129      41.2824  4.657913e-12 4.051814e-02   pass
5000   5008   4      2.029      41.0859  3.111936e-11 4.339344e-02   pass
5000   5008   4      1.994      41.8239  3.111936e-11 4.339344e-02   pass
10000  10000  4      15.217     43.8247  9.915883e-11 3.496441e-02   pass
10000  10000  4      14.917     44.7046  9.915883e-11 3.496441e-02   pass
15000  15000  4      49.292     45.6557  2.298756e-10 3.620579e-02   pass
15000  15000  4      49.909     45.0914  2.298756e-10 3.620579e-02   pass
18000  18008  4      84.708     45.9066  2.800032e-10 3.066379e-02   pass
18000  18008  4      84.675     45.9243  2.800032e-10 3.066379e-02   pass
20000  20016  4      116.272    45.8763  4.214754e-10 3.730981e-02   pass
20000  20016  4      116.430    45.8141  4.214754e-10 3.730981e-02   pass
22000  22008  4      156.026    45.5028  4.296590e-10 3.147083e-02   pass
22000  22008  4      154.228    46.0335  4.296590e-10 3.147083e-02   pass
25000  25000  4      226.157    46.0649  5.659749e-10 3.218496e-02   pass

There are some major differences here. It's worth calling out that
turboboost was likely still enabled on these systems allowing them to
change their clocking while executing. This leads this data to be
imperfect; however, the change in turboboost and the general performance
factors into the approximately 3x difference that we've observed and
would make some amount of sense in the back of our heads given how we
believe the CPU is operating. Basically, it is my belief that while
turboboost is definitely on the scene, it is not responsbile for the
magnintude of change given that the it is running at most 400-500 MHz
faster.
Regression Testing

Due to the changes to the PLT code and the default xsave state I was
worried about two different sources of regressions. The first being that
the cost of the xsave in the PLT would be more expensive and the second
being that somehow we would impact the ability to use the ymm/zmm
registers with the changes there. To deal with this I took a few
different approaches.

To measure performance on an older Westmere system, Patrick kindly
compared a full illumos build before and after and we found that the
timing differences were mostly in the noise. However, it's worth
pointing out that we only changed the 32-bit PLT, though the xsave
changes are valid for both types of processes. In addition, Patrick also
booted bhyve VMs with the new plt on the westmere systems.

In a VMware VM I booted a 64-bit lx image and then cloned libressl and
built it 64-bit and ran through their make check which was intended to
be a reasonable exercise of the FPU. I also went and compared the
runtime of this before and after. To do this I made sure that the
repository was built and ran make check. Then, I rebooted the system
and ran make check a number of times on platforms with and without
this change. The data was in the noise.

Similarly I went and on a SKX system used a 64-bit SmartOS zone and
compared the execution of a single threaded compile and subsequent make
checks. These tests are noisy, but were roughly equivalent with the
newer bits being somewhat faster.

Both KVM and bhyve VMs were tested. bhyve by Patrick on westmere and KVM
on skylake. There should not be much impact to them except as they are
impacted by these other problems.

Actions #1

Updated by Electric Monk almost 3 years ago

  • Status changed from New to Closed

git commit d0158222a5936ac26b7a03241b7d2df18cc544c8

commit  d0158222a5936ac26b7a03241b7d2df18cc544c8
Author: Robert Mustacchi <rm@joyent.com>
Date:   2018-06-19T19:34:37.000Z

    9596 Initial xsave xstate_bv should not include all features
    9595 rtld should conditionally save AVX-512 state
    Reviewed by: John Levon <john.levon@joyent.com>
    Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Bryan Cantrill <bryan@joyent.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Actions

Also available in: Atom PDF