Project

General

Profile

Feature #533 ยป 20100804_kuriakose.kuruvilla.txt

ARC case for AVX support - Roland Mainz, 2010-12-16 03:48 PM

 
1
Subject: Intel AVX Support [PSARC/2010/311 FastTrack timeout 08/11/2010]
2
To:      PSARC-ext@Sun.Com
3
Cc:	 kuriakose.kuruvilla@oracle.com,lejun.zhu@intel.com,bart.smaalders@oracle.com,mara.roccaforte@oracle.com,robert.a.kasten@intel.com,edward.gillett@oracle.com,roger.faulkner@oracle.com,steve.chessin@oracle.com
4
Bcc:     one-pager-list@sac.sfbay one-pager-log@sac.sfbay sac-bar@sac.sfbay
5

    
6
I am sponsoring the following fast-track for Lejun Zhu and Kuriakose
7
Kuruvilla.  It requests a patch/micro binding.  Man pages with change
8
bars are available in the materials directory.  Timeout is set to
9
8/11/2010.
10

    
11
Thanks,
12
Sherry
13

    
14
Template Version: @(#)sac_nextcase 1.70 03/30/10 SMI
15
This information is Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
16
1. Introduction
17
   1.1. Project/Component Working Name:
18
        Support Intel Advanced Vector Extensions (AVX) in Solaris
19

    
20
   1.2. Name of Document Author/Supplier:
21
        Lejun Zhu
22
        Kuriakose Kuruvilla
23

    
24
   1.3. Date of This Document:
25
        Jul 14th, 2010
26

    
27
   1.4. Name of Major Document Customer(s)/Consumer(s):
28
        1.4.1. The Community you expect to review your project:
29
        1.4.2. The ARC(s) you expect to review your project:
30
                // Leave blank if you don't have any preference
31
                // This item is advisory only
32

    
33
   1.5. Email Aliases:
34
        1.5.2. Responsible Engineer:
35
		<lejun.zhu@intel.com>
36
		<kuriakose.kuruvilla@oracle.com>
37
        1.5.4. Interest List: intel-core@sun.com
38

    
39
2. Project Summary
40
   2.1. Project Description:
41
        Intel Advanced Vector Extensions (AVX) introduces new instructions 
42
that accelerate vector floating point operations. AVX uses 256-bit 
43
registers, which requires extension of current Solaris interfaces that 
44
manipulate FPU registers, such as signal stack layout, "setcontext" 
45
syscall and /proc interface.
46

    
47
   2.2. Risks and Assumptions:
48
        When extending Solaris interfaces and/or data structures to 
49
support AVX, it is very important to provide binary compatibility for 
50
existing applications. All application binaries that exist today will 
51
continue to run on new Solaris kernel without having to be recompiled. The 
52
only restriction for existing binaries is that they have enough space on 
53
the signal stack to hold the extra state (see 4.1.2 for details).
54

    
55
3. Business Summary
56
   3.1. Problem Area:
57
        Intel AVX is a new 256-bit SIMD FP vector extension of Intel 
58
Architecture. Its introduction is targeted for the next Intel 
59
Microarchitecture (code named: Sandy Bridge). Intel AVX accelerates the 
60
trends towards FP intensive computation in general purpose applications 
61
like image, video, and audio processing, engineering applications such as 
62
3D modeling and analysis, scientific simulation, and financial analytics.
63

    
64
   3.2. Market/Requester:
65

    
66
   3.3. Business Justification:
67
        Customers who use Solaris x86 will expect to run optimized 
68
applications on Sandy Bridge and future generations of Intel CPU, and many 
69
optimizations will use AVX instructions, such as Basic Linear Algebra 
70
Subprograms (BLAS) with DGEMM Routine, or sequential and cluster FFTs. 
71
Also, the amd64 ABI has already supported YMM registers. Latest GCC can 
72
generate AVX instructions, and an AVX-enabled Sun Studio compiler is being 
73
developed. All of these will require kernel changes to support AVX.
74

    
75
   3.4. Competitive Analysis:
76
        Support for XSAVE and YMM has already been implemented in Linux kernel.
77

    
78
   3.5. Opportunity Window/Exposure:
79
        Intel will support AVX instructions in the next generation Intel 
80
Microarchitecture (code-named: Sandy Bridge). Applications optimized for 
81
Sandy Bridge will emerge soon. In order to enable these optimizations on 
82
Solaris, we need to get the OS support into ON as soon as possible.
83

    
84
   3.6. How will you know when you are done?:
85
	Applications can run correctly and use YMM registers on Intel machines
86
that support AVX/YMM registers.
87

    
88
4. Technical Description:
89
    4.1. Details:
90
        4.1.1 Extending ucontext_t
91
            Structure ucontext_t will have the same size as its previous 
92
version and all existing fields will be at the same byte offset, except 
93
part of its filler is used for xregs extension. A new flag UC_XREGS (0x10) 
94
for the uc_flags field will be added. Any ucontext_t with this flag set is 
95
considered to have the new layout described in this PSARC case. Any 
96
ucontext_t with this flag not set in its uc_flags is considered to have 
97
the original layout and its uc_xrs field will be ignored.
98

    
99
            A data structure will be defined as follows for both 32-bit 
100
and 64-bit applications:
101

    
102
            #define XRS_ID  0x00737278 /* the string "xrs" */
103

    
104
            typedef struct {
105
                unsigned long xrs_id;
106
                caddr_t xrs_ptr;
107
            } xrs_t;
108

    
109
            Field xrs_id must have the value XRS_ID (little endian), and 
110
xrs_ptr will point to a prxregset_t data structure.
111

    
112
            Part of uc_filler in current ucontext_t definition will be 
113
used to store xrs_t. The new definition of ucontext_t is:
114

    
115
            typedef struct  ucontext {
116
                unsigned long   uc_flags;
117
                ucontext_t      *uc_link;
118
                sigset_t        uc_sigmask;
119
                stack_t         uc_stack;
120
                mcontext_t      uc_mcontext;
121
                xrs_t           uc_xrs;
122
                long            uc_filler[3];
123
            } ucontext_t;
124

    
125
            For 64-bit kernel to work with 32-bit application, the 
126
following definition will be used:
127

    
128
            typedef struct {
129
                uint32_t xrs_id;
130
                caddr32_t xrs_ptr;
131
            } xrs32_t;
132

    
133
            typedef struct ucontext32 {
134
                uint32_t        uc_flags;
135
                caddr32_t       uc_link;
136
                sigset32_t      uc_sigmask;
137
                stack32_t       uc_stack;
138
                mcontext32_t    uc_mcontext;
139
                xrs32_t	        uc_xrs;
140
                int32_t         uc_filler[3];
141
            } ucontext32_t;
142

    
143
            Only the kernel components that are specified in this PSARC 
144
case will use the extended form of ucontext_t. The rest of kernel code 
145
that uses ucontext_t, for example getcontext() calls, will remain 
146
unchanged, which means flag UC_XREGS will always be cleared in uc_flags, 
147
and uc_xrs will be filled with 0 (if it is a kernel created ucontext_t) in 
148
these unchanged cases.
149

    
150
            There is an alternative way to store xrs_t in ucontext_t, 
151
which is putting xrs_t into fpregset_t of mcontext_t. But there is no 
152
trailing padding bytes in the amd64 definition of fpregset_t, therefore we 
153
will have to use the software available bytes (defined in table "XSAVE 
154
Save Area Layout for x87 FPU and SSE State" of Intel Software Developer's 
155
Manual Volume 2B) and put xrs_t in the middle of fpregset_t (before 
156
"status" and "xstatus"). In the i386 definition, the layout of fpregset_t 
157
is different - there are no software available bytes, but there is 
158
trailing space because fp_emul is larger than fpchip_state. Putting the 
159
same field in different orders in the amd64 and i386 definition will make 
160
it less straightforward in C declaration. Also, since XSAVE is designed to 
161
be an generic mechanism capable of saving more than FPU state, putting 
162
xrs_t in fpregset_t will look strange if we have non-FPU state in the 
163
future. In Solaris implementation whenever we do a selective copy in of 
164
fpregset_t we will need to change the code to always copy in xrs_t in 
165
fpregset_t, which makes it even more confusing. So using uc_filler is the 
166
better way to extend uccontext_t without changing the size of any existing 
167
data structure.
168

    
169
            Data type prxregset_t is defined as:
170

    
171
            #define XR_TYPE_XSAVE  0x101
172

    
173
            typedef struct prxregset {
174
                uint32_t pr_type;
175
                uint32_t pr_align;
176
                uint32_t pr_xsize;
177
                uint32_t pr_pad;
178
                union {
179
                    struct pr_xsave {
180
                        uint16_t pr_fcw;
181
                        uint16_t pr_fsw;
182
                        uint16_t pr_fctw;
183
                        uint16_t pr_fop;
184
            #if defined(__amd64)
185
                        uint64_t pr_rip;
186
                        uint64_t pr_rdp;
187
            #else
188
                        uint32_t pr_eip;
189
                        uint16_t pr_cs;
190
                        uint16_t __pr_ign0;
191
                        uint32_t pr_dp;
192
                        uint16_t pr_ds;
193
                        uint16_t __pr_ign1;
194
            #endif
195
                        uint32_t pr_mxcsr;
196
                        uint32_t pr_mxcsr_mask;
197
                        union {
198
                            uint16_t pr_fpr_16[5];
199
                            u_longlong_t pr_fpr_mmx;
200
                            uint32_t __pr_fpr_pad[4];
201
                        } pr_st[8];
202
            #if defined(__amd64)
203
                        upad128_t pr_xmm[16];
204
                        upad128_t __pr_ign2[3];
205
            #else
206
                        upad128_t pr_xmm[8];
207
                        upad128_t __pr_ign2[11];
208
            #endif
209
                        union {
210
                            struct {
211
                                uint64_t pr_xcr0;
212
                                uint64_t pr_mbz[2];
213
                            } pr_xsave_info;
214
                            upad128_t __pr_pad[3];
215
                        } pr_sw_avail;
216
                        uint64_t pr_xstate_bv;
217
                        uint64_t pr_rsv_mbz[2];
218
                        uint64_t pr_reserved[5];
219
            #if defined(__amd64)
220
                        upad128_t pr_ymm[16];
221
            #else
222
                        upad128_t pr_ymm[8];
223
                        upad128_t __pr_ign3[8];
224
            #endif
225
                    } pr_xsave;
226
                } pr_un;
227
            } prxregset_t;
228

    
229
            Field pr_type and pr_align are derived from SPARC prxregset_t 
230
definition. Field pr_type will have the value XR_TYPE_XSAVE indicating 
231
that this data structure is defined as in this PSARC case. Field pr_align 
232
is currently unused and should be set to 0. The value of field pr_xsize 
233
will be equal to the size of the union member selected by the pr_type, in 
234
this case sizeof (struct pr_xsave). pr_pad will make the layout of 
235
prxregset_t identical under 32-bit and 64-bit compilers, its value is 
236
ignored by the kernel and should be set to 0.
237

    
238
            Field pr_xsave is used to store XSAVE/AVX specific state. The 
239
first 512 byte part is the same as FXSAVE layout (the same as the amd64 
240
definition of fpregset_t, see also the FXSAVE instruction in Intel 
241
Software Developer's Manual Volume 2A), followed by 64 byte XSAVE header 
242
and 256 byte YMM state. See table "General Layout of XSAVE/XRSTOR Save 
243
Area" in Intel Software Developer's Manual Volume 2B for detailed meaning 
244
of each new field in XSAVE layout. Field pr_sw_avail represents the 
245
software available bytes defined in table "XSAVE Save Area Layout for x87 
246
FPU and SSE State" of Intel Software Developer's Manual Volume 2B, and is 
247
used to store additional information. Its field pr_xcr0 contains the value 
248
of XCR0 of the CPU when the state is saved, the rest of the area should be 
249
set to 0.
250

    
251
            The YMM registers are always 256 bits in length for both 
252
32-bit and 64-bit code. The lower part (bit 127-0) of the YMM registers is 
253
mapped onto the corresponding XMM registers. pr_ymm only stores the upper 
254
part (bit 255-128), and the lower part is stored in pr_xmm as they used to 
255
be. This is consistent with the XSAVE layout used by the CPU. All 16 YMM 
256
registers are available in 64-bit code, but 32-bit code can only access 
257
first 8 YMM registers.
258

    
259
        4.1.1.1 Future extensibility
260

    
261
            The definition of prxregset_t is extendable. In case of a 
262
future extension, for example adding a 512 byte state XYZ into the 
263
context, the definition will be:
264

    
265
            typedef struct prxregset {
266
                uint32_t pr_type;
267
                uint32_t pr_align;
268
                uint32_t pr_xsize;
269
                uint32_t pr_pad;
270
                union {
271
                    struct pr_xsave {
272
                        uint16_t pr_fcw;
273
                        uint16_t pr_fsw;
274
                        uint16_t pr_fctw;
275
                        uint16_t pr_fop;
276
            #if defined(__amd64)
277
                        uint64_t pr_rip;
278
                        uint64_t pr_rdp;
279
            #else
280
                        uint32_t pr_eip;
281
                        uint16_t pr_cs;
282
                        uint16_t __pr_ign0;
283
                        uint32_t pr_dp;
284
                        uint16_t pr_ds;
285
                        uint16_t __pr_ign1;
286
            #endif
287
                        uint32_t pr_mxcsr;
288
                        uint32_t pr_mxcsr_mask;
289
                        union {
290
                            uint16_t pr_fpr_16[5];
291
                            u_longlong_t pr_fpr_mmx;
292
                            uint32_t __pr_fpr_pad[4];
293
                        } pr_st[8];
294
            #if defined(__amd64)
295
                        upad128_t pr_xmm[16];
296
                        upad128_t __pr_ign2[3];
297
            #else
298
                        upad128_t pr_xmm[8];
299
                        upad128_t __pr_ign2[11];
300
            #endif
301
                        union {
302
                            struct {
303
                                uint64_t pr_xcr0;
304
                                uint64_t pr_mbz[2];
305
                            } pr_xsave_info;
306
                            upad128_t __pr_pad[3];
307
                        } pr_sw_avail;
308
                        uint64_t pr_xstate_bv;
309
                        uint64_t pr_rsv_mbz[2];
310
                        uint64_t pr_reserved[5];
311
            #if defined(__amd64)
312
                        upad128_t pr_ymm[16];
313
            #else
314
                        upad128_t pr_ymm[8];
315
                        upad128_t __pr_ign3[8];
316
            #endif
317
                        uint8_t pr_xyz[512];
318
                    } pr_xsave;
319
                } pr_un;
320
            } prxregset_t;
321

    
322
            As a general rule, when extending prxregset_t as defined in 
323
this PSARC case, all existing fields should be kept in the same byte 
324
offset within prxregset_t, unless the value of pr_type is changed as well.
325

    
326
            The kernel will verify the integrity of data structure "pxr" 
327
and convert an earlier version to latest version using the following 
328
pseudo code:
329

    
330
            prxregset_t *pxr; /* Possibly earlier version of xregs */
331
            prxregset_t kxr; /* Latest definition of xregs in kernel */
332
            /* FXSAVE + XSAVE header + YMM */
333
            size_t size_avx = 512 + 64 + 256;
334
            size_t size_xyz = size_avx + 512; /* AVX + XYZ */
335

    
336
            if (pxr->pr_type != XR_TYPE_XSAVE) {
337
                /* pxr is invalid */
338
            }
339

    
340
            If (pxr->pr_xsize < size_avx) {
341
                /* pxr is invalid */
342
            }
343

    
344
            if ((pxr->pr_un.pr_xsave.pr_xstate_bv & XFEATURE_XYZ) &&
345
                 pxr->pr_xsize < size_xyz) {
346
                /* pxr is invalid */
347
            }
348

    
349
            bcopy(&pxr->pr_un.pr_xsave, &kxr.pr_un.pr_xsave, 512);
350

    
351
            if (pxr->pr_un.pr_xsave.pr_xstate_bv & XFEATURE_AVX) {
352
                bcopy(&pxr->pr_un.pr_xsave.pr_ymm,
353
                    &kxr.pr_un.pr_xsave.pr_ymm,
354
                    sizeof (kxr.pr_un.pr_xsave.pr_ymm));
355
            }
356

    
357
            if (pxr->pr_un.pr_xsave.pr_xstate_bv & XFEATURE_XYZ) {
358
                bcopy(&pxr->pr_un.pr_xsave.pr_xyz,
359
                    &kxr.pr_un.pr_xsave.pr_xyz,
360
                    sizeof (kxr.pr_un.pr_xsave.pr_xyz));
361
            }
362

    
363
            Applications are encouraged to use the value in pr_xsize to 
364
work with future prxregset_t extensions. When pr_xyz is added, such 
365
applications that are developed before XYZ extension can still work, for 
366
example, to copy a prxregset_t structure in memory to a file by 
367
calculating the number of bytes to copy using pr_xsize.
368

    
369
        4.1.2 Signal stack
370
            An amd64 signal frame looks like this on the stack:
371
            old %rsp:
372
                    <128 bytes of untouched stack space>
373
                    <a siginfo_t [optional]>
374
                    <a prxregset_t [optional]> (added by this PSARC case)
375
                    <a ucontext_t>
376
                    <siginfo_t *>
377
                    <signal number>
378
            new %rsp:       <return address (deliberately invalid)>
379

    
380
            An i386 SVR4/ABI signal frame looks like this on the stack:
381
            old %esp:
382
                    <a siginfo32_t [optional]>
383
                    <a prxregset_t [optional]> (added by this PSARC case)
384
                    <a ucontext32_t>
385
                    <pointer to that ucontext32_t>
386
                    <pointer to that siginfo32_t>
387
                    <signo>
388
            new %esp:       <return address (deliberately invalid)>
389

    
390
            User space code will access siginfo_t and ucontext_t through 
391
pointers, so the signature of signal handler is not changed. This PSARC 
392
case adds a prxregset_t to the signal frame if the system supports AVX. 
393
The existence of prxregset_t can be determined from the uc_flags and 
394
uc_xrs of ucontext_t.
395

    
396
            On AVX enabled systems, this extension will appear on every 
397
application that has its FPU state enabled, even if the application does 
398
not use AVX or YMM registers. As the result, some additional space in the 
399
signal handler stack will be used (sizeof prxregset_t, which is 848 bytes 
400
for now).
401

    
402
        4.1.3 Getsetcontext syscall
403
            Syscall "getsetcontext" (100) will be extended to deal with 
404
the new ucontext_t. On AVX enabled machines, YMM is considered part of FPU 
405
state. If UC_XREGS and UC_FPU are found in uc_flags, and xrs_id and 
406
pr_xsize are valid, SETCONTEXT will update the LWP's FPU state using the 
407
content in prxregset_t.
408

    
409
            GETCONTEXT will not be extended, because all YMM registers are 
410
caller saved. Compiler generated code or assembly programmer should 
411
restore YMM registers when the called function returns. When the 
412
application tries to restore the context saved by GETCONTEXT, the 
413
application will continue to execute from the next instruction after 
414
setcontext() if UC_CPU is not set, or from the next instruction after 
415
previous getcontext() if UC_CPU is set. In both cases, the YMM registers 
416
should be restored by compiler generated code or hand written assembly. 
417
Therefore it is not necessary for GETCONTEXT to return YMM content.
418

    
419
            As a result, the kernel code branch to process the extended 
420
SETCONTEXT will only be executed when UC_XREGS is set, which happens only 
421
when SETCONTEXT is called at the end of libc signal handling routine. 
422
Normal calls of libc setcontext() from user application do not have 
423
UC_XREGS set, and SETCONTEXT will work the same way as before.
424

    
425
        4.1.4 /proc
426
            File /proc/<pid>/lwp/<id>/xregs will be used to support 
427
read/write of extra state (XSTATE_BV and YMM for now) through the procfs 
428
interface. The following functions in x86 architecture will be added:
429
            PCSXREG (procfs ioctl)
430

    
431
            The length of /proc/<pid>/lwp/<id>/xregs will be sizeof 
432
(prxregset_t) on machines that support AVX, and 0 on other machines.
433

    
434
            On machines that support AVX, the content of 
435
/proc/<pid>/lwp/<id>/xregs will be the same as "<a prxregset_t 
436
[optional]>" that is placed on signal stack.
437

    
438
            When using PCSXREG to set extra state, user space application 
439
must provide a prxregset_t that is valid under the integrity check, and is 
440
meaningful on current machine. In this PSARC case, prxregset_t will be 
441
considered invalid if the values of these fields: pr_type or pr_xsize 
442
fails the sanity check defined in 4.1.1.1. Trying to set an invalid 
443
prxregset_t or set YMM in a system that does not support AVX will not 
444
change the state of the target process. In such situations, EINVAL will be 
445
returned. This is different from the behavior in SPARC implementation, 
446
which does not verify the content of prxregset_t.
447

    
448
            The value of pr_xcr0 is informational and should not be 
449
modified when application modifies state through procfs.
450

    
451
            The bit values in pr_xstate_bv indicate the corresponding area 
452
of the FPU state that should be set (bit X = 1) or initialized (bit X = 
453
0). When bit X is set to 0, values in corresponding area will be ignored 
454
and initial values will be set into FPU instead. For the meaning of each 
455
bit, see the operation section of XRSTOR instruction in Intel Software 
456
Developer's Manual Volume 2B.
457

    
458
        4.1.4.1 Future extensibility of procfs
459
            Considering the general rules to extend prxregset_t in 
460
4.1.1.1, it is safe for applications that are developed today to 
461
read/write xregs on a future Solaris version. For example, an application 
462
which reads xregs, update ymm0 and write it back can do the following:
463

    
464
            prxregset_t *pxr;
465
            struct pr_xsave *pxs;
466
            size_t len;
467
            /* FXSAVE + XSAVE header + YMM */
468
            size_t size_avx = 512 + 64 + 256;
469

    
470
            len = get_file_size("/proc/123/lwp/1/xregs");
471
            if (len < size_avx) {
472
                //The system does not have xregs extension with
473
                //AVX state. Stop.
474
            }
475

    
476
            pxr = (prxregset_t *)malloc(len);
477
            read_entire_file("/proc/123/lwp/1/xregs", pxr);
478

    
479
            // Sanity check.
480
            if (pxr->pr_type != XR_TYPE_XSAVE) {
481
                //Not the xregs type we want, stop.
482
            }
483

    
484
            pxs = &pxr->pr_un.pr_xsave;
485

    
486
            if ((pxs->pr_sw_avail.pr_xsave_info.pr_xcr0 &
487
                XFEATURE_AVX) == 0) {
488
                //This system does not have AVX. Stop.
489
            }
490

    
491
            if (!pxs->pr_xstate_bv & XFEATURE_AVX) {
492
                //YMM is in initial state, clean and set.
493
                memset(pxs->pr_ymm,
494
                    0, sizeof (pxs->pr_ymm));
495
                pxs->pr_xstate_bv |= XFEATURE_AVX;
496
            }
497

    
498
            //Update pxs->pr_ymm[0]
499
            ioctl_set_xregs("/proc/123/lwp/1/xregs", pxr);
500

    
501
        4.1.5 Core dump format
502
            We already have prxregset_t as part of the core dump file (see 
503
core(4)). On x86 systems that have xregs extension, e.g. the systems that 
504
have enabled AVX extension, the core dump will include note sections with 
505
prxregset_t as described in the manpage.
506

    
507
            To support dumping xregs using gcore(1), libproc needs to be 
508
extended by adding SPARC specific APIs to x86 definition as well. Because 
509
libproc is only used privately by tools such as dtrace and gcore, this 
510
will not cause compatibility issues.
511

    
512
        4.1.6 mdb(1)
513
            mdb(1) will support disassembling all the new AVX 
514
instructions, as well as XSAVE, XRESTORE, XGETBV and XSETBV. Also, mdb(1) 
515
will be able to process YMM values as part of the FPU state in the same 
516
way as we have for XMM today. On platforms that support AVX, mdb(1) "print 
517
floating point registers" commands ($x and $y) will print the %ymm value 
518
for each %xmm that is printed. An example of mdb output is:
519

    
520
            > $x
521
            _fp_hw 0x03 (80387 chip with SSE)
522
            < ...omitted >
523

    
524
            %xmm0  0x5f4d4d585f4d4d585f4d4d585f4d4d58
525
            %xmm1  0x00000000000000000000000000000000
526
            %xmm2  0x00000000000000000000000000000000
527
            %xmm3  0x00000000000000000000000000000000
528
            %xmm4  0x00000000000000000000000000000000
529
            %xmm5  0x00000000000000000000000000000000
530
            %xmm6  0x00000000000000000000000000000000
531
            %xmm7  0x00000000000000000000000000000000
532
            %ymm0  0x5f4d4d595f4d4d595f4d4d595f4d4d595f4d4d585f4d4d585f4d4d585f4d4d58
533
            %ymm1  0x0000000000000000000000000000000000000000000000000000000000000000
534
            %ymm2  0x0000000000000000000000000000000000000000000000000000000000000000
535
            %ymm3  0x0000000000000000000000000000000000000000000000000000000000000000
536
            %ymm4  0x0000000000000000000000000000000000000000000000000000000000000000
537
            %ymm5  0x0000000000000000000000000000000000000000000000000000000000000000
538
            %ymm6  0x0000000000000000000000000000000000000000000000000000000000000000
539
            %ymm7  0x0000000000000000000000000000000000000000000000000000000000000000
540

    
541
        4.1.7 Hardware Capabilities
542
            Two new hardware capability bits, AV_386_XSAVE (0x10000000) 
543
and AV_386_AVX (0x20000000) are added. Applications that needs to be aware 
544
of XSAVE and AVX can test the hardware capabilities on the current system 
545
to see if it supports these features.
546

    
547
        4.1.8 Linux Brand
548
            Solaris 10 has a Lx Brand that supports an earlier version of 
549
Linux kernel, which doesn't have AVX or XSAVE support. In Nevada, it has 
550
been removed by PSARC/2010/169. So, nothing needs to be changed in Linux 
551
Brand for now. However, changes will be required in the Lx Brand in the 
552
future if we upgrade the Linux kernel to a version that supports AVX.
553
	
554
    4.2. Bug/RFE Number(s):
555
        6714685 Need to support Intel Advanced Vector Extensions (AVX)
556

    
557
        Also the following CRs are related to this PSARC:
558
        6958308 XSAVE/XRSTOR mechanism to save and restore processor state
559
        6970220 Replace use of XSAVE with XSAVEOPT instruction for optimized context saves
560

    
561
    4.3. In Scope:
562
        Kernel changes necessary to support the use of AVX instructions 
563
and YMM registers in user space.
564

    
565
    4.4. Out of Scope:
566
        Debuggers and tools other than mdb(1) to manipulate YMM.
567

    
568
    4.5. Interfaces:
569

    
570
        Interface                               Stability
571
        ---------                               ---------
572

    
573
Data structure:
574
        ucontext_t / ucontext32_t               Evolving
575
        xrs_t / xrs32_t                         Evolving
576
        prxregset_t                             Evolving
577

    
578
Procfs ioctl:
579
        PCSXREG                                 Evolving
580

    
581
User space API:
582
    proc_service:
583
        ps_lgetxregsize                         Evolving
584
        ps_lgetxregs                            Evolving
585
        ps_lsetxregs                            Evolving
586
    thread_db:
587
        td_thr_getxregsize                      Evolving
588
        td_thr_getxregs                         Evolving
589
        td_thr_setxregs                         Evolving
590
    libproc:
591
        Plwp_getxregs                           Evolving
592
        Plwp_setxregs                           Evolving
593

    
594
    4.6. Doc Impact:
595
        As xregs is introduced in x86 architecture as well, the following
596
	manpages needs to be updated. All changed man pages can be found in
597
	attachment. Modified versions of these man pages have change bars.
598

    
599
        ps_lgetregs(3PROC)
600
        proc_service(3PROC)
601
        td_thr_getgregs(3C_DB)
602

    
603

    
604
    4.7. Admin/Config Impact:
605
        N/A
606

    
607
    4.8. HA Impact:
608
        N/A
609

    
610
    4.9. I18N/L10N Impact:
611
        N/A
612

    
613
    4.10. Packaging & Delivery:
614
        N/A
615

    
616
    4.11. Security Impact:
617
        We need to prevent YMM state to be "leaked" between processes, 
618
because these registers may contain sensitive information. Currently we 
619
always set or initialize all FPU state (legacy FP, XMM and YMM) during 
620
context switch when using XRSTOR. In signal stack handling, we keep YMM 
621
values untouched if UC_XREGS is not set in ucontext_t. This is the same as 
622
how we handle the rest of FPU state today if UC_FPU is not set.
623

    
624
    4.12. Dependencies:
625
        N/A
626

    
627
5. Reference Documents:
628
        Intel Advanced Vector Extensions Programming Reference
629
        Document #319433, www.intel.com
630
        Chapter 3 System Programming Model: OS requirement to support AVX.
631

    
632
        Intel 64 and IA-32 Architectures Software Developer's Manual
633
        Document #253667, www.intel.com
634
        XSAVE layout.
635

    
636
        System V Application Binary Interface
637
        AMD64 Architecture Processor Supplement
638
        Draft Version 0.99, www.x86-64.org
639
        Section 3.2: ABI requirement for YMM.
640

    
641
6. Resources and Schedule:
642
   6.1. Projected Availability:
643
        3Q '10
644

    
645
   6.2. Cost of Effort:
646
        The implementation and unit testing will take 1 engineer and 3 
647
months. Also it will take 1 engineer and 3 months for integration and back 
648
port.
649

    
650
   6.4. Product Approval Committee requested information:
651
        6.4.1. Consolidation or Component Name: ON (OS/Net)
652
        6.4.7. Target RTI Date/Release:
653
                OpenSolaris build 147
654
        6.4.8. Target Code Design Review Date:
655

    
656
   6.5. ARC review type:
657
        Fast track
658

    
659
   6.6. ARC Exposure: open
660
       6.6.1. Rationale: Part of OpenSolaris
661

    
662
7. Prototype Availability:
663
   7.1. Prototype Availability:
664
	Prototype currently available
665

    
666
   7.2. Prototype Cost:
667
	2 person-weeks required to verify implementation
668

    
669

    
670

    
671
6. Resources and Schedule
672
    6.4. Steering Committee requested information
673
   	6.4.1. Consolidation C-team Name:
674
		ON
675
    6.5. ARC review type: FastTrack
676
    6.6. ARC Exposure: open
677

    
    (1-1/1)