Project

General

Profile

Bug #4281

nhm and nhmex are dangerous

Added by Garrett D'Amore over 6 years ago. Updated over 6 years ago.

Status:
New
Priority:
High
Category:
driver - device drivers
Start date:
2013-10-31
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:

Description

Both the intel_nhm and intel_nhmex drivers crash on some platforms. In particular, I have an Ivy Bridge based platform that uses a different region of PCI configuration than typically found. It lacks ACPI.

On this system, the intel_nhm and intel_nhmex crash during nhm_init().

The reason is that the code in nhm_init blithely attempts to start doing pci configuration space accesses (these are to memory mapped regions) that don't work.

nhm_init(void) {
int slot;

/* return ENOTSUP if there is no PCI config space support. */
if (pci_getl_func == NULL)
return (ENOTSUP);
for (slot = 0; slot < MAX_CPU_NODES; slot++) {
nhm_chipset = CPU_ID_RD(slot);
if (nhm_chipset NHM_EP_CPU || nhm_chipset NHM_WS_CPU ||
nhm_chipset NHM_JF_CPU || nhm_chipset NHM_WM_CPU)
break;
}
if (slot == MAX_CPU_NODES) {
return (ENOTSUP);
}
mem_reg_init();
return (0);
}

Now, it turns out that probably we need to put some limit on what spaces are addressable via the memory mapped accesses as our addresses are not accessible. But the code above is just flat out broken. No attempt to validate whether the CPU is a nehalem or not is made before going and hitting configuration space.

intel_nhmex is closed source, and probably should just be killed. it does mean that we'd lose memory based self-healing on those platforms, but without source its unclear how we could even meaningfully diagnose any problems there.

intel_nhm should use something -- perhaps CPUID instructions -- to verify whether it is on a candidate CPU before accessing configuration space blindly.

History

#1

Updated by Rich Lowe over 6 years ago

This is the same issue I hit on Athlon64 -- there, intel_nhmex entirely buggers the APIC.

A fix for that bug is (in the open driver) easier: You'll notice that we bind these drivers to specific devices, they aren't pseudo devices. Nevertheless, we run this crap ultimately from _init(!). In the driver we can modify, re-arranging things such that they only run after we decide we're going to attach would at least limit them to trying to prod the hardware they expect to bind to (which may not really fix your bug, but would fix mine at least...)

Also available in: Atom PDF