Bug #1439

Panic when booting on IBM SystemX

Added by Jens Rosenboom over 2 years ago. Updated about 2 years ago.

Status:Resolved Start date:2011-09-01
Priority:Normal Due date:
Assignee:- % Done:

100%

Category:kernel Spent time: -
Target version:-
Difficulty:Medium Tags:

Description

Right after loading the kernel, there is a panic "vmem_hash_delete: bad free" called from fm_smb_fmacompat, see attached screenshot. The code in uts/intel/os/fmsmb.c:562 seems to do more allocs than frees anyway, can someone please clean this up?

Bildschirmfoto_2011-08-31_um_15.11.05.png - Screenshot (26.7 kB) Jens Rosenboom, 09/01/2011 09:04 am

sm.patch (1.2 kB) Jens Rosenboom, 09/05/2011 07:05 pm

sm.patch2 (1.2 kB) Jens Rosenboom, 09/05/2011 07:33 pm

issue_00_similar_panic_messages_and_stack.jpg - same error on console (212.1 kB) Jon Strabala, 01/04/2012 01:47 am

issue_UEFI__1_13_also_fails.jpg - it happens on UEFI 1.13 (103.7 kB) Jon Strabala, 01/04/2012 01:47 am

issue_work_around.jpg - up and running on an oi_151a image using a kmdb trick (105.8 kB) Jon Strabala, 01/04/2012 02:21 am

History

Updated by Jens Rosenboom over 2 years ago

Some more findings:

The problem only occurs with recent UEFI versions, 1.11 and 1.12 for my x3650M3. After a downgrade to 1.10, the system runs without a problem.

After inserting a "goto bad" right at the start of fm_smb_fmacompat, the machine also works fine with the 1.12 version.

Looking at the smbios data, I can even guess where the crash comes from: The final type 11 record doesn't have a string attached, leaving cnt=0 when the "for i" loop finishes and thus giving bad parameters for the kmem_free call.

Updated by Jens Rosenboom over 2 years ago

I made a patch for this, can someone please check this? Works for me, but no guarantees otherwise. Might be good if it would be tested on a board that has this SUNW-PRMS-1 string, too

Updated by Jens Rosenboom over 2 years ago

New version, thanks to richlowe for spotting the mistake

Updated by Jon Strabala over 2 years ago

Hit the same issue on a brand new IBM x3550 M3 but on a newer BIOS, e.g. UEFI 1.13 (build date 9/23/2011) when doing a fresh install.

Since I can not install oi_151a, I can not apply the patch. Looks like I need to make my own ISO (sort of a pain) until there is a new ISO release (hopefully with the patch in it).

Updated by Jon Strabala over 2 years ago

Looking at the file ./usr/src/uts/intel/os/fmsmb.c it occurred to me that x86gentopo_legacy can be set in /etc/system and this lead me to ask furhter questions in the #illumos IRC, I determined with help from (Daemar) that I could indeed boot from grub by adding -kd to drop into a kmdb prompt early on and then type the following:

x86gentopo_legacy /W 1

to disable the path that causes the 'panic' followed immediately by

:c 

to leave kmdb and continue execution. This got me to an installer! I have attached an image of the system up and running.

Then I can use the kmdb trick until I can get an /etc/system setup with a x86gentopo_legacy fix and then eventually a make and boot into new BE based with the patch described above (where the kmdb trick or a modified /etc/system is not needed).

Updated by Jon Strabala over 2 years ago

As discussed the oi_151a live CD installs and runs on my IBM x3550 M3 with UEFI (BIOS) rev 1.13 in the default oi_151a BE (openindiana) as long as I have the following setting in my /etc/system:

set x86gentopo_legacy=1

Once I built and booted into a new BE (nightly-2012-01-04-uefi) with the above patch applied, e.g. https://www.illumos.org/issues/1439#note-3 , I could then remove the above setting from my /etc/system file and the system still worked just fine.

root@systemx:~# beadm list
BE                      Active Mountpoint Space Policy Created
nightly-2012-01-04-uefi NR     /          9.00G static 2012-01-04 17:03
openindiana             -      -          58.8M static 2012-01-03 19:06

Thus for what its worth patch works fine on another system (albeit another IBM System X).

Updated by Jon Strabala over 2 years ago

I can apply patch https://www.illumos.org/issues/1439#note-3 to both an IBM SystemX (above) and a non-IBM system (e.g. Supermicro X9SCA-F nightly BE nightly-2012-01-17-uefi) in both cases the resulting BE seems to work just fine.

I do not have other machines available to test. Is there anything else I can do to help get this patch integrated?

Updated by Rich Lowe about 2 years ago

  • Category set to kernel
  • Status changed from New to Resolved
  • % Done changed from 0 to 100
  • Tags deleted (needs-triage)

Resolved in r13613 8abd7b12d92f

Also available in: Atom PDF