pg info lacks enough detail for dispatcher to decide what to do
When many people (myself included) think of sockets/cores/threads they are thinking about the way Intel does things. AMD processors are a bit different. While Intel cores are totally separate, AMD cores aren't. AMD cores have separate integer execution units, but they share floating point execution units. There are no threads (in the Intel sense) on AMD chips. (AMD cores may also share L1-I cache.) Sorry for using Intel vs. AMD, it's about microarchitectures not about vendors. It's just very convenient (although not entirely accurate) to talk about these differences as Intel vs. AMD.
For example, on an 4-core Intel chip with HyperThreading, we'll get:
::walk cpu | ::print cpu_t cpu_physid | ::printf "%d/%d/%d\n" "struct cpu_physid" cpu_chipid cpu_coreid cpu_cacheid 0/0/2 0/0/2 0/1/2 0/1/2 0/2/2 0/2/2 0/3/2 0/3/2
While on a 8-core AMD chip we'll see:
0/0/0 0/1/1 0/2/2 0/3/3 0/4/4 0/5/5 0/6/6 0/7/7
So, the issue Joyent (an Intel shop) saw (#4247) was that kthreads were migrating all over the chip (between all threads/cores - Intel speak) because the check was for cacheid. They changed it to match coreid, which allows migrations between HyperThreads as cost-free. This sounds good, but...
On AMD chips, checking against coreid doesn't make sense since each integer pipeline gets its own coreid. Migrating to a different core (AMD speak) may or may not be cost-free.
The good news is, Joyent's change will not change the behavior on AMD systems. The bad news is, that it's a bit of a hack. "cost-free" is supposed to be about caches, not about sharing of execution units. A proper fix will involve some updates to the pg infrastructure to better represent the different levels of cache affinity.
Looking at pg_cmt_cpu_init() which is responsible for filling out the chip/core/cache IDs, it iterates through the hw data. When it finds a cache (PGHW_CACHE), it stashes it away for later use. After it finishes iterating, it'll set the cache ID to the stashed pg's id. This makes sense so far. This is what happens on the Intel example in my previous comment. Looking at the pginfo for that system:
$ pginfo PG RELATIONSHIP CPUs 0 System 0-7 3 Socket 0-7 2 Cache 0-7 1 Integer_Pipeline 0 1 4 CPU_PM_Idle_Power_Domain 0 1 5 Integer_Pipeline 2 3 6 CPU_PM_Idle_Power_Domain 2 3 7 Integer_Pipeline 4 5 8 CPU_PM_Idle_Power_Domain 4 5 9 Integer_Pipeline 6 7 10 CPU_PM_Idle_Power_Domain 6 7
This explains why the cacheid fields are all 2 (a strange number otherwise). Since it is a i7-2720QM which has 4x 32k L1I, 4x 32k L1D, 4x 256k L2, and 1x 6MB L3, pg id 2 has to be referring to the L3. For the purpose of thread migration, we want to be either looking at L1 or L2. (Well, we'd want to check L1 first, then L2, then L3, etc.) Neither of these caches is expressed in the pg data.
For comparison, here's pginfo for the AMD system:
$ pginfo PG RELATIONSHIP CPUs 0 System 0-7 2 Socket 0-7 3 CPU_PM_Active_Power_Domain 0-7 1 Floating_Point_Unit 0 1 4 Floating_Point_Unit 2 3 5 Floating_Point_Unit 4 5 6 Floating_Point_Unit 6 7
Since it doesn't have any caches, each cpu_t's cpu_physid will be initialized to the cpu id (see pghw_physid_create) and never changed afterwards. This explains why the cacheid matches the coreid.
Hans suggested changing the definition of what a cache means to be L1. Currently it is the last level cache:
PGHW_CACHE, /* Cache (generally last level) */
Since AMD systems don't include cache info in their pg data, changing the meaning of PGHW_CACHE (from last level to L1) won't accomplish anything on AMD systems.
The way I see it, there are two problems that need addressing:
- The pg info doesn't contain enough information about the hierarchy of caches (Intel chips seem to fill in only L3, AMD chips don't fill in anything).
- The dispatcher makes a boolean decision ("is it cost-free to migrate") about a pair of CPUs. Rather, it should try to find the best available pair. In other words, it is better to migrate a thread between two different cores (Intel speak) on the same chip, than between cores on two different chips.